Temporally Consistent Label Interpolation for Robust Surgical Multi-Task Learning under Challenging Conditions

Garam Kim; Juyoun Park

arxiv: 2606.26634 · v1 · pith:TSEHCK23new · submitted 2026-06-25 · 💻 cs.CV

Temporally Consistent Label Interpolation for Robust Surgical Multi-Task Learning under Challenging Conditions

Garam Kim , Juyoun Park This is my paper

Pith reviewed 2026-06-26 05:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords surgical scene understandingmulti-task learninglabel interpolationoptical flowzero-shot segmentationtemporal consistencyinstrument segmentationphase recognition

0 comments

The pith

A flow-guided framework generates dense pseudo labels from sparse surgical keyframes to balance multi-task learning across temporal and spatial tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies a core mismatch where temporal tasks like phase recognition receive dense frame-level labels while spatial tasks like instrument segmentation are annotated only on sparse keyframes. It introduces FAROS to interpolate these sparse labels into temporally consistent dense pseudo labels by combining zero-shot segmentation mask propagation with optical flow estimation. The resulting labels feed into a single Transformer-based model that jointly optimizes phase recognition, step recognition, anticipation, instrument segmentation, and action recognition. Experiments on GraSP, MISAW, and AutoLaparo show gains in cross-task representation learning, with additional validation on DAVIS 2017 confirming the interpolation works beyond surgery.

Core claim

FAROS generates temporally consistent dense pseudo labels from sparse keyframe annotations by combining zero-shot segmentation-based mask propagation with optical flow estimation; these labels are then integrated into a unified Transformer-based multi-task framework that jointly learns surgical phase recognition, step recognition, anticipation, instrument segmentation, and action recognition, enabling balanced optimization between dense temporal supervision and sparse spatial supervision.

What carries the argument

FAROS, a flow-guided label interpolation framework that merges zero-shot segmentation mask propagation with optical flow estimation to produce consistent dense pseudo labels.

If this is right

The unified Transformer model achieves higher performance on all five tasks simultaneously on GraSP, MISAW, and AutoLaparo.
Cross-task representation learning improves because dense temporal supervision now aligns with dense spatial supervision.
Label interpolation quality holds on the non-surgical DAVIS 2017 benchmark under a sparse ground-truth protocol.
Joint optimization becomes feasible without separate handling of annotation density differences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could lower labeling costs for new surgical datasets by requiring annotations only on keyframes.
Similar interpolation might apply to other video domains such as autonomous driving or sports analysis where spatial labels are expensive.
If the propagation step is made faster, the method could support online surgical assistance systems.
The framework highlights a general pattern: using motion cues to densify supervision when tasks have mismatched annotation densities.

Load-bearing premise

Combining zero-shot segmentation-based mask propagation with optical flow estimation reliably overcomes the limits of appearance-based methods under occlusion, smoke, and motion blur.

What would settle it

A test set of surgical videos with heavy smoke or motion blur where the generated dense pseudo labels show lower accuracy than a ground-truth dense annotation baseline when measured by segmentation or action label metrics.

Figures

Figures reproduced from arXiv: 2606.26634 by Garam Kim, Juyoun Park.

**Figure 1.** Figure 1: Illustration of the annotation granularity mismatch and the motivation for label interpolation. (Left) Frame-level workflow tasks benefit from dense temporal annotations across all frames, whereas pixel-level instrument tasks are sparsely annotated only at selected keyframes. Label interpolation bridges this supervision gap by generating dense pseudo-annotations for all intermediate frames. (Right) Represe… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed FAROS pipeline. (Top) The SAM2-based propagation module processes input frames through an image encoder, memory attention, and mask decoder, with a prompt encoder providing corrective spatial prompts. Propagated mask features are stored in a memory bank via the memory encoder for temporally coherent propagation. (Bottom) The flowguided module operates between keyframes 𝐾0 and 𝐾1 ,… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of mask propagation on an instrument exit case. SAM2-only propagation fails to track instrument disappearance and re-entry, producing temporally inconsistent masks. FAROS detects the propagation failure via flow-guided consistency checking and recovers accurate segmentation through reprompting [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of mask propagation under illumination variation and blood-induced occlusion. SAM2-only propagation degrades under these appearance disruptions, whereas FAROS maintains robust segmentation by leveraging geometric motion priors to compensate for appearance-driven memory attention failures. Bidirectional SAM2 Prompting. We process each keyframe segment (𝑘0 , 𝑘1 ) independently. Naively… view at source ↗

**Figure 5.** Figure 5: Overview of the proposed multi-task learning framework. (1. Instrument Segmentation Baseline) A Mask2Former-based RPN is trained on ground-truth and pseudo masks (weighted by 𝑤) to produce segment embeddings for mask prediction and class scoring. (2. All-task Finetuning) The frozen RPN provides precomputed region embeddings. An MViT backbone with task-dedicated CLS tokens extracts spatio-temporal features … view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of mask propagation results on the DAVIS 2017 validation set under the sparse ground-truth interpolation protocol (ground truth provided every 30 frames). FAROS maintains temporally consistent segmentation across diverse challenging scenarios including fast-moving objects, partial occlusion, and large inter-frame appearance changes, while SAM2 baseline propagation exhibits mask drift… view at source ↗

**Figure 7.** Figure 7: Visualization of Instrument Segmentation Results for Comparison on the GraSP Dataset [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of Instrument Segmentation Results for Comparison on the MISAW Dataset also recovers and surpasses the single-task baseline across all metrics, further validating that flow-guided label interpolation effectively resolves annotation imbalance and enables balanced cross-task optimization. Qualitative results in [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of Instrument Segmentation Results for Comparison on the AutoLaparo Dataset [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Co-occurrence matrix between instrument and step in the GraSP. Each cell indicates the proportion of frames in which a given instrument appears during the corresponding step, normalized per instrument row. The Large Needle Driver exhibits strong co-occurrence with multiple suture-related step categories, while the Clip Applier is predominantly associated with the Clip Pedicles step [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 11.** Figure 11: Per-class Step recognition AP change(ΔAP = w/ inter − w/o inter)(left) and instrument segmentation IoU change (ΔIoU = w/ inter − w/o inter)(right) on GraSP. keep pace with the surgical video stream. On the workflow recognition tasks, our framework attains a mean per-clip latency of 22.72 ± 5.49 ms, corresponding to a throughput of 44.02 clips/s at 32-bit precision on a single GPU, comfortably exceeding t… view at source ↗

**Figure 12.** Figure 12: Scatter plot of instrument ΔIoU versus linked step ΔAP on GraSP. Each bubble represents an instrument–step pair with strong co-occurrence, and bubble size is proportional to the magnitude of instrument IoU change. this, we propose FAROS, a flow-guided label interpolation framework that combines promptable segmentationbased mask propagation with optical flow estimation to generate temporally consistent d… view at source ↗

**Figure 13.** Figure 13: Co-occurrence matrix between instrument and step categories in the MISAW. The Needle instrument shows strong association with the Suture Making and Needle Holding categories, reflecting its functional role in suturing-phase procedures [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 15.** Figure 15: Scatter plot of instrument ΔIoU versus linked step ΔAP on MISAW. Bubble size reflects the magnitude of instrument IoU change. The positive correlation confirms that instruments benefiting most from interpolation propagate performance gains to their semantically associated step categories, consistent with findings on GraSP [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative comparison of mask propagation on the DAVIS 2017 drone subset (sparse-GT, interval = 30). Blue and red masks denote the person and drone objects, respectively. As the drone enters the scene, SAM2 baseline fails to detect and track it, producing missing masks throughout the re-entry sequence. FAROS successfully recovers accurate segmentation of both objects via flow-guided reprompting, maintain… view at source ↗

read the original abstract

Effective multi-task learning for surgical scene understanding is fundamentally hindered by annotation granularity mismatch; temporal workflow tasks such as phase recognition, step recognition and anticipation benefit from dense frame-level supervision, whereas pixel-level spatial tasks including instrument segmentation and action recognition are only sparsely annotated on selected keyframes due to prohibitive labeling costs. This supervision imbalance undermines shared representation learning and limits joint optimization across heterogeneous surgical tasks. To address this, we propose Flow-guided Annotation for Robust Operating Scenes (FAROS), a flow-guided label interpolation framework, that combines zero-shot segmentation-based mask propagation with optical flow estimation to overcome the limitations of appearance-based propagation under challenging surgical conditions such as occlusion, smoke, and motion blur, generating temporally consistent dense pseudo labels from sparse keyframe annotations. The densified instrument masks and action labels are integrated into a unified Transformer-based multi-task framework that jointly learns surgical phase recognition, step recognition, anticipation, instrument segmentation, and action recognition, enabling balanced optimization between dense temporal supervision and sparse spatial supervision. The label interpolation quality of FAROS is first validated on the DAVIS 2017 benchmark under a sparse ground-truth protocol, confirming robust propagation beyond the surgical domain. Extensive experiments on GraSP, MISAW, and AutoLaparo benchmarks further demonstrate that FAROS significantly improves cross-task representation learning and enhances holistic surgical scene understanding performance across spatio-temporal tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FAROS tries to fix sparse labels in surgical videos via zero-shot masks plus optical flow, but the abstract shows no numbers so the gains stay unproven.

read the letter

The main thing to know is that this paper targets a real practical problem: surgical multi-task setups need dense labels for temporal tasks like phase recognition but only get sparse keyframes for segmentation and action labels. FAROS propagates those sparse labels using zero-shot segmentation masks combined with optical flow to create temporally consistent pseudo-labels, then feeds them into a single Transformer for joint training on phase, step, anticipation, segmentation, and action recognition.

What the work does is identify the supervision imbalance clearly and test the interpolation step on DAVIS 2017 under a sparse ground-truth protocol. That choice makes sense as a way to check generality before the surgical benchmarks. The idea of mixing zero-shot propagation with flow to handle smoke, occlusion, and blur is a reasonable extension of existing label-densification techniques to the OR domain.

The soft spots sit in the missing evidence. The abstract claims significant improvements on GraSP, MISAW, and AutoLaparo but gives no numbers, no error bars, and no ablation that isolates the contribution of the interpolated labels versus the joint optimization. The stress-test note is on target here: label quality is only shown on DAVIS, not on the surgical videos under the actual challenging conditions. Without direct metrics on pseudo-label accuracy for the target domain, it is hard to know whether the reported end-task gains come from better labels or from the unified model itself. The central assumption that the combination reliably beats appearance-based methods under surgical noise therefore stays untested in the visible text.

This paper is for groups already working on surgical scene understanding or label-efficient video multi-task learning. A reader who needs concrete methods for handling annotation mismatch would find the framework description useful once the full experiments are available.

It deserves peer review because the problem is well-posed and the method is described in enough detail to evaluate, even if the current claims need stronger backing on label quality and ablations.

Referee Report

3 major / 1 minor

Summary. The paper proposes FAROS (Flow-guided Annotation for Robust Operating Scenes), a framework that combines zero-shot segmentation-based mask propagation with optical flow estimation to generate temporally consistent dense pseudo labels from sparse keyframe annotations. These densified labels are integrated into a unified Transformer-based multi-task framework for joint learning of phase recognition, step recognition, anticipation, instrument segmentation, and action recognition. The interpolation quality is validated on DAVIS 2017 under a sparse ground-truth protocol, with claims of significant performance improvements on the GraSP, MISAW, and AutoLaparo surgical benchmarks under challenging conditions.

Significance. If the central claims hold, the work addresses a practical annotation imbalance in surgical scene understanding and could improve cross-task representation learning by providing dense supervision for temporal tasks while leveraging sparse spatial annotations. The cross-domain validation on DAVIS 2017 and the focus on robustness to occlusion, smoke, and motion blur represent potential strengths for generalizability.

major comments (3)

Abstract: The abstract asserts that FAROS 'significantly improves cross-task representation learning and enhances holistic surgical scene understanding performance across spatio-temporal tasks' on GraSP, MISAW, and AutoLaparo, yet provides no quantitative results, error bars, ablation details, or specific metrics to support these claims, preventing any assessment of the magnitude or reliability of the reported gains.
Abstract: The core claim that the zero-shot segmentation + optical flow combination produces accurate, temporally consistent dense pseudo labels specifically under surgical challenges (occlusion, smoke, motion blur) is not supported by direct label-quality metrics or ablations on the target surgical datasets; validation is restricted to DAVIS 2017, leaving open the possibility that any end-task gains arise from the unified Transformer or joint optimization rather than the interpolation mechanism.
Abstract and methods description: No details are given on the unified Transformer architecture, the joint loss formulation, training protocol, or ablation studies isolating the contribution of the FAROS-generated labels versus baseline multi-task learning, which are load-bearing for attributing improvements to the proposed interpolation.

minor comments (1)

The expansion of the FAROS acronym appears only after its first use; including it in the title or abstract opening sentence would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments highlighting areas where the abstract could better support the claims. We will revise the abstract to incorporate quantitative results and clarify the validation approach. We address each major comment below.

read point-by-point responses

Referee: Abstract: The abstract asserts that FAROS 'significantly improves cross-task representation learning and enhances holistic surgical scene understanding performance across spatio-temporal tasks' on GraSP, MISAW, and AutoLaparo, yet provides no quantitative results, error bars, ablation details, or specific metrics to support these claims, preventing any assessment of the magnitude or reliability of the reported gains.

Authors: We agree that including key quantitative metrics in the abstract would strengthen the summary. In the revision, we will add specific results such as accuracy improvements on phase recognition (e.g., +X% on GraSP) and mIoU gains on instrument segmentation, along with references to ablation studies, while keeping the abstract concise. revision: yes
Referee: Abstract: The core claim that the zero-shot segmentation + optical flow combination produces accurate, temporally consistent dense pseudo labels specifically under surgical challenges (occlusion, smoke, motion blur) is not supported by direct label-quality metrics or ablations on the target surgical datasets; validation is restricted to DAVIS 2017, leaving open the possibility that any end-task gains arise from the unified Transformer or joint optimization rather than the interpolation mechanism.

Authors: The DAVIS 2017 evaluation under sparse annotation protocol establishes the interpolation robustness in challenging conditions analogous to surgery. On surgical benchmarks, the contribution of FAROS labels is isolated via ablations comparing against baseline multi-task learning without densified labels. We will revise the abstract to explicitly note that downstream task gains on GraSP/MISAW/AutoLaparo serve as the primary validation for surgical applicability. revision: partial
Referee: Abstract and methods description: No details are given on the unified Transformer architecture, the joint loss formulation, training protocol, or ablation studies isolating the contribution of the FAROS-generated labels versus baseline multi-task learning, which are load-bearing for attributing improvements to the proposed interpolation.

Authors: Section 3 details the shared Transformer encoder with task-specific decoders, the joint loss as a weighted combination of cross-entropy (temporal tasks) and segmentation losses, the training protocol (AdamW optimizer, specific schedules, augmentations), and Section 4.3 presents ablations isolating FAROS label contributions. We will update the abstract with a brief reference to these elements for clarity. revision: yes

Circularity Check

0 steps flagged

No circularity; framework is self-contained

full rationale

The paper introduces FAROS as a new combination of zero-shot segmentation-based mask propagation and optical flow estimation to densify sparse keyframe labels, then integrates the results into a unified Transformer multi-task model. Validation occurs on DAVIS 2017 under sparse GT protocol, with end-task metrics reported on GraSP/MISAW/AutoLaparo. No equations, fitted parameters, self-definitional reductions, or load-bearing self-citations appear in the text; the method is presented as an independent engineering solution to annotation imbalance rather than deriving from or renaming prior fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger constructed from abstract only; no explicit free parameters, axioms, or invented entities beyond the proposed framework name are stated.

axioms (1)

domain assumption Optical flow and zero-shot segmentation remain reliable under surgical occlusions, smoke, and motion blur
Invoked as the reason the method succeeds where appearance-based propagation fails.

invented entities (1)

FAROS framework no independent evidence
purpose: Flow-guided label interpolation to produce dense pseudo labels
Newly proposed system; no independent evidence outside the paper is mentioned.

pith-pipeline@v0.9.1-grok · 5768 in / 1370 out tokens · 27748 ms · 2026-06-26T05:02:45.295945+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 2 canonical work pages

[1]

Deeplearningforsurgicalinstrumentrecognitionandsegmentationin robotic-assistedsurgeries:asystematicreview

Ahmed,F.A.,Yousef,M.,Ahmed,M.A.,Ali,H.O.,Mahboob,A.,Ali, H.,Shah,Z.,Aboumarzouk,O.,AlAnsari,A.,Balakrishnan,S.,2024. Deeplearningforsurgicalinstrumentrecognitionandsegmentationin robotic-assistedsurgeries:asystematicreview. ArtificialIntelligence Review 58, 1

2024
[2]

Multitask learning in minimallyinvasivesurgicalvision:Areview.MedicalImageAnalysis 101, 103480

Alabi, O., Vercauteren, T., Shi, M., 2025. Multitask learning in minimallyinvasivesurgicalvision:Areview.MedicalImageAnalysis 101, 103480

2025
[3]

2018 robotic scene segmentation challenge

Allan, M., Kondo, S., Bodenstedt, S., Leger, S., Kadkhodamoham- madi, R., Luengo, I., Fuentes, F., Flouty, E., Mohammed, A., Ped- ersen, M., et al., 2020. 2018 robotic scene segmentation challenge. arXiv preprint arXiv:2001.11190

arXiv 2020
[4]

Themedicalsegmentationdecathlon

Antonelli, M., Reinke, A., Bakas, S., Farahani, K., Kopp-Schneider, A., Landman, B.A., Litjens, G., Menze, B., Ronneberger, O., Sum- mers,R.M.,etal.,2022. Themedicalsegmentationdecathlon. Nature communications 13, 4128

2022
[5]

Pixel-wise recognition for holistic surgical scene under- standing

Ayobi, N., Rodríguez, S., Pérez, A., Hernández, I., Aparicio, N., Dessevres, E., Peña, S., Santander, J., Caicedo, J.I., Fernández, N., et al., 2024. Pixel-wise recognition for holistic surgical scene under- standing. arXiv preprint arXiv:2401.11174

arXiv 2024
[6]

Baghbaderani, R.K., Li, Y., Wang, S., Qi, H., 2024. Temporally- consistentvideosemanticsegmentationwithbidirectionalocclusion- guidedfeaturepropagation,in:ProceedingsoftheIEEE/CVFWinter Conference on Applications of Computer Vision, pp. 685–695

2024
[7]

Semi- supervised learning for network-based cardiac mr image segmenta- tion, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer

Bai, W., Oktay, O., Sinclair, M., Suzuki, H., Rajchl, M., Tarroni, G., Glocker, B., King, A., Matthews, P.M., Rueckert, D., 2017. Semi- supervised learning for network-based cardiac mr image segmenta- tion, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 253–260

2017
[8]

Multitask learning

Caruana, R., 1997. Multitask learning. Machine learning 28, 41–75

1997
[9]

Scientific Reports 12, 19721

Chen,Q.,Poullis,C.,2022.Motionestimationforlargedisplacements and deformations. Scientific Reports 12, 19721

2022
[10]

Masked-attention mask transformer for universal image segmenta- tion,in:ProceedingsoftheIEEE/CVFconferenceoncomputervision and pattern recognition, pp

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R., 2022. Masked-attention mask transformer for universal image segmenta- tion,in:ProceedingsoftheIEEE/CVFconferenceoncomputervision and pattern recognition, pp. 1290–1299

2022
[11]

Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model, in: European conference on computer vision, Springer

Cheng, H.K., Schwing, A.G., 2022. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model, in: European conference on computer vision, Springer. pp. 640–658

2022
[12]

Rethinking space-time networks with improved memory coverage for efficient video object segmentation

Cheng, H.K., Tai, Y.W., Tang, C.K., 2021. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advancesinneuralinformationprocessingsystems34, 11781–11794

2021
[13]

Isrobotic-assistedsurgerybetter? AMA Journal of Ethics 25, 598–604

Chuchulo,A.,Ali,A.,2023. Isrobotic-assistedsurgerybetter? AMA Journal of Ethics 25, 598–604

2023
[14]

Multi-tasklearningwithdeepneuralnetworks: A survey

Crawshaw,M.,2020. Multi-tasklearningwithdeepneuralnetworks: A survey. arXiv preprint arXiv:2009.09796

arXiv 2020
[15]

Tecno: Surgical phase recognition with multi-stage temporal convolutional networks, in: International conferenceonmedicalimagecomputingandcomputer-assistedinter- vention, Springer

Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N., 2020. Tecno: Surgical phase recognition with multi-stage temporal convolutional networks, in: International conferenceonmedicalimagecomputingandcomputer-assistedinter- vention, Springer. pp. 343–352

2020
[16]

Deep learning in surgical workflow analysis: a review of phase and step recognition

Demir, K.C., Schieber, H., Weise, T., Roth, D., May, M., Maier, A., Yang, S.H., 2023. Deep learning in surgical workflow analysis: a review of phase and step recognition. IEEE Journal of Biomedical and Health Informatics 27, 5405–5417. Page 15 of 17 Temporally Consistent Label Interpolation for Robust Surgical Multi-Task Learning under Challenging Conditions

2023
[17]

Thepascalvisualobjectclasseschallenge:A retrospective

Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J.,Zisserman,A.,2015. Thepascalvisualobjectclasseschallenge:A retrospective. International journal of computer vision 111, 98–136

2015
[18]

Proceed- ings of the IEEE International Conference on Computer Vision, 99 92–10002 (2021) https://doi.org/10.1109/ICCV48922.2021.00986

Fan,H.,Xiong,B.,Mangalam,K.,Li,Y.,Yan,Z.,Malik,J.,Feichten- hofer,C.,2021.Multiscalevisiontransformers,in:IEEEInternational Conference on Computer Vision. doi:10.1109/ICCV48922.2021.00675

work page doi:10.1109/iccv48922.2021.00675 2021
[19]

Gao, X., Jin, Y., Long, Y., Dou, Q., Heng, P.A., 2021. Trans-svnet: Accurate phase recognition from surgical videos via hybrid embed- dingaggregationtransformer,in:Internationalconferenceonmedical image computing and computer-assisted intervention, Springer. pp. 593–603

2021
[20]

In: 2018 IEEE/CVF 18 R

Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D.A., Toderici, G., Li, Y., Ricco, S., Sukthankar, R., Schmid, C., Malik, J., 2017. Ava: A video dataset of spatio-temporally localized atomic visual actions, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. doi:10.1109/CVPR.2018.00633

work page doi:10.1109/cvpr.2018.00633 2017
[21]

Role of robotic- assisted surgery in public health: its advantages and challenges

Handa, A., Gaidhane, A., Choudhari, S.G., 2024. Role of robotic- assisted surgery in public health: its advantages and challenges. Cureus 16

2024
[22]

Micro-surgical anastomose workflow recognition challenge report

Huaulmé, A., Sarikaya, D., Le Mut, K., Despinoy, F., Long, Y., Dou, Q., Chng, C.B., Lin, W., Kondo, S., Bravo-Sánchez, L., et al., 2021. Micro-surgical anastomose workflow recognition challenge report. Computer Methods and Programs in Biomedicine 212, 106452

2021
[23]

Microsurgical instru- ment segmentation for robot-assisted surgery

Jeong, T.K., Kim, G., Park, J., 2025. Microsurgical instru- ment segmentation for robot-assisted surgery. arXiv preprint arXiv:2509.11727

arXiv 2025
[24]

Jin,Y.,Cheng,K.,Dou,Q.,Heng,P.A.,2019. Incorporatingtemporal prior from motion flow for instrument segmentation in minimally invasivesurgeryvideo,in:Internationalconferenceonmedicalimage computing and computer-assisted intervention, Springer. pp. 440– 448

2019
[25]

Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network

Jin,Y.,Dou,Q.,Chen,H.,Yu,L.,Qin,J.,Fu,C.W.,Heng,P.A.,2017. Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37, 1114–1126

2017
[26]

Multi-task recurrent convolutional network with correlation loss for surgical video analysis

Jin,Y.,Li,H.,Dou,Q.,Chen,H.,Qin,J.,Fu,C.W.,Heng,P.A.,2020. Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572

2020
[27]

Segment anything, in: Proceedings of the IEEE/CVF international conference on computer vision, pp

Kirillov,A.,Mintun,E.,Ravi,N.,Mao,H.,Rolland,C.,Gustafson,L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al., 2023. Segment anything, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 4015–4026

2023
[28]

Concurrentsegmentationandlocaliza- tion for tracking of surgical instruments, in: International conference on medical image computing and computer-assisted intervention, Springer

Laina, I., Rieke, N., Rupprecht, C., Vizcaíno, J.P., Eslami, A., Tombari,F.,Navab,N.,2017. Concurrentsegmentationandlocaliza- tion for tracking of surgical instruments, in: International conference on medical image computing and computer-assisted intervention, Springer. pp. 664–672

2017
[29]

Surgical process modelling: a review

Lalys, F., Jannin, P., 2014. Surgical process modelling: a review. International journal of computer assisted radiology and surgery 9, 495–511

2014
[30]

Pseudo-label: The simple and efficient semi- supervised learning method for deep neural networks, in: Workshop on challenges in representation learning, ICML, Atlanta

Lee, D.H., et al., 2013. Pseudo-label: The simple and efficient semi- supervised learning method for deep neural networks, in: Workshop on challenges in representation learning, ICML, Atlanta. p. 896

2013
[31]

Recurrent dynamicembeddingforvideoobjectsegmentation,in:Proceedingsof theIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion, pp

Li,M.,Hu,L.,Xiong,Z.,Zhang,B.,Pan,P.,Liu,D.,2022. Recurrent dynamicembeddingforvideoobjectsegmentation,in:Proceedingsof theIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion, pp. 1332–1341

2022
[32]

Drift robust non-rigid optical flowenhancementforlongsequences

Li, W., Cosker, D., Brown, M., 2016. Drift robust non-rigid optical flowenhancementforlongsequences. JournalofIntelligent&Fuzzy Systems 31, 2583–2595

2016
[33]

Deep learning for surgical workflow analysis: a survey of progresses, limitations, and trends

Li, Y., Zhao, Z., Li, R., Li, F., 2024. Deep learning for surgical workflow analysis: a survey of progresses, limitations, and trends. Artificial Intelligence Review 57, 291

2024
[34]

Surgical sam 2: Real-time segment anything in surgical video by efficient frame pruning

Liu, H., Zhang, E., Wu, J., Hong, M., Jin, Y., 2024. Surgical sam 2: Real-time segment anything in surgical video by efficient frame pruning. arXiv preprint arXiv:2408.07931

arXiv 2024
[35]

Liu,Z.,Lin,Y.,Cao,Y.,Hu,H.,Wei,Y.,Zhang,Z.,Lin,S.,Guo,B.,
[36]

10012–10022

Swintransformer:Hierarchicalvisiontransformerusingshifted windows, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022
[37]

Semi-supervised medical image segmentation via uncertainty rectified pyramid consistency

Luo, X., Wang, G., Liao, W., Chen, J., Song, T., Chen, Y., Zhang, S., Metaxas, D.N., Zhang, S., 2022. Semi-supervised medical image segmentation via uncertainty rectified pyramid consistency. Medical Image Analysis 80, 102517

2022
[38]

Maier-Hein,L.,Vedula,S.S.,Speidel,S.,Navab,N.,Kikinis,R.,Park, A., Eisenmann, M., Feussner, H., Forestier, G., Giannarou, S., et al.,
[39]

Nature Biomedical Engineering 1, 691–696

Surgicaldatasciencefornext-generationinterventions. Nature Biomedical Engineering 1, 691–696
[40]

Robotic surgery: applications, limitations, and impact on surgical education

Morris, B., 2005. Robotic surgery: applications, limitations, and impact on surgical education. Medscape General Medicine 7, 72

2005
[41]

Joint-task regulariza- tion for partially labeled multi-task learning, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp

Nishi, K., Kim, J., Li, W., Pfister, H., 2024. Joint-task regulariza- tion for partially labeled multi-task learning, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp. 16152–16162

2024
[42]

Nwoye, C.I., Gonzalez, C., Yu, T., Mascagni, P., Mutter, D., Marescaux, J., Padoy, N., 2020. Recognition of instrument-tissue interactions in endoscopic videos via action triplets, in: International conferenceonmedicalimagecomputingandcomputer-assistedinter- vention, Springer. pp. 364–374

2020
[43]

Video object seg- mentationusingspace-timememorynetworks,in:Proceedingsofthe IEEE/CVF international conference on computer vision, pp

Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J., 2019. Video object seg- mentationusingspace-timememorynetworks,in:Proceedingsofthe IEEE/CVF international conference on computer vision, pp. 9226– 9235

2019
[44]

The 2017 davis challenge on video object segmentation

Pont-Tuset,J.,Perazzi,F.,Caelles,S.,Arbeláez,P.,Sorkine-Hornung, A., Van Gool, L., 2017. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675

Pith/arXiv arXiv 2017
[45]

Robust in- stancetrackingviauncertaintyflow.arXivpreprintarXiv:2010.04367

Qian, J., Nan, J., Ancha, S., Okorn, B., Held, D., 2020. Robust in- stancetrackingviauncertaintyflow.arXivpreprintarXiv:2010.04367

arXiv 2020
[46]

Weakly supervised temporal convolutional networks for fine-grained surgical activity recognition

Ramesh,S.,Dall’Alba,D.,Gonzalez,C.,Yu,T.,Mascagni,P.,Mutter, D., Marescaux, J., Fiorini, P., Padoy, N., 2023. Weakly supervised temporal convolutional networks for fine-grained surgical activity recognition. IEEE Transactions on Medical Imaging 42, 2592–2602

2023
[47]

Sam 2: Segmentanythinginimagesandvideos,in:InternationalConference on Learning Representations, pp

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al., 2025. Sam 2: Segmentanythinginimagesandvideos,in:InternationalConference on Learning Representations, pp. 28085–28128

2025
[48]

Unsupervised learningofopticalflowwithpatchconsistencyandocclusionestima- tion

Ren, Z., Yan, J., Yang, X., Yuille, A., Zha, H., 2020. Unsupervised learningofopticalflowwithpatchconsistencyandocclusionestima- tion. Pattern Recognition 103, 107191

2020
[49]

Rivoir, D., Bodenstedt, S., Funke, I., von Bechtolsheim, F., Distler, M., Weitz, J., Speidel, S., 2020. Rethinking anticipation tasks: Uncertainty-aware anticipation of sparse surgical instrument usage for context-aware assistance, in: International conference on medical image computing and computer-assisted intervention, Springer. pp. 752–762

2020
[50]

U-net: Convolutional networks for biomedical image segmentation, in: International Con- ferenceonMedicalimagecomputingandcomputer-assistedinterven- tion, Springer

Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation, in: International Con- ferenceonMedicalimagecomputingandcomputer-assistedinterven- tion, Springer. pp. 234–241

2015
[51]

An overview of multi-task learning in deep neural networks

Ruder, S., 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098

Pith/arXiv arXiv 2017
[52]

Evaluation of extra pixel interpolation with maskprocessingformedicalimagesegmentationwithdeeplearning

Rukundo, O., 2024. Evaluation of extra pixel interpolation with maskprocessingformedicalimagesegmentationwithdeeplearning. Signal, Image and Video Processing 18, 7703–7710

2024
[53]

Robotic surgery

Schreuder, H., Verheijen, R., 2009. Robotic surgery. BJOG: An International Journal of Obstetrics & Gynaecology 116, 198–213

2009
[54]

Fun-sis: A fully unsupervised approach for surgical instrument seg- mentation

Sestini, L., Rosa, B., De Momi, E., Ferrigno, G., Padoy, N., 2023. Fun-sis: A fully unsupervised approach for surgical instrument seg- mentation. Medical Image Analysis 85, 102751

2023
[55]

Hierarchical image saliency detection on extended cssd

Shi, J., Yan, Q., Xu, L., Jia, J., 2015. Hierarchical image saliency detection on extended cssd. IEEE transactions on pattern analysis and machine intelligence 38, 717–729

2015
[56]

Semi-supervisedlearning withprogressiveunlabeleddataexcavationforlabel-efficientsurgical workflow recognition

Shi,X.,Jin,Y.,Dou,Q.,Heng,P.A.,2021. Semi-supervisedlearning withprogressiveunlabeleddataexcavationforlabel-efficientsurgical workflow recognition. Medical Image Analysis 73, 102158. Page 16 of 17 Temporally Consistent Label Interpolation for Robust Surgical Multi-Task Learning under Challenging Conditions

2021
[57]

Auto- matic instrument segmentation in robot-assisted surgery using deep learning, in: 2018 17th IEEE international conference on machine learning and applications (ICMLA), IEEE

Shvets,A.A.,Rakhlin,A.,Kalinin,A.A.,Iglovikov,V.I.,2018. Auto- matic instrument segmentation in robot-assisted surgery using deep learning, in: 2018 17th IEEE international conference on machine learning and applications (ICMLA), IEEE. pp. 624–628

2018
[58]

Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results

Tarvainen, A., Valpola, H., 2017. Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems 30

2017
[59]

Raft:Recurrentall-pairsfieldtransformsfor optical flow, in: European conference on computer vision, Springer

Teed,Z.,Deng,J.,2020. Raft:Recurrentall-pairsfieldtransformsfor optical flow, in: European conference on computer vision, Springer. pp. 402–419

2020
[60]

Is learning the n-th thing any easier than learning the first? Advances in neural information processing systems 8

Thrun, S., 1995. Is learning the n-th thing any easier than learning the first? Advances in neural information processing systems 8

1995
[61]

Endonet: A deep architecture for recognition tasks on laparoscopic videos

Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy, N., 2016. Endonet: A deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging 36, 86–97

2016
[62]

Towards holistic surgical scene understanding, in: International con- ferenceonmedicalimagecomputingandcomputer-assistedinterven- tion, Springer

Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P., 2022. Towards holistic surgical scene understanding, in: International con- ferenceonmedicalimagecomputingandcomputer-assistedinterven- tion, Springer. pp. 442–452

2022
[63]

Look before you match: Instance understanding matters in video object segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pp

Wang,J.,Chen,D.,Wu,Z.,Luo,C.,Tang,C.,Dai,X.,Zhao,Y.,Xie, Y., Yuan, L., Jiang, Y.G., 2023. Look before you match: Instance understanding matters in video object segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pp. 2268–2278

2023
[64]

Wang, Z., Lu, B., Long, Y., Zhong, F., Cheung, T.H., Dou, Q., Liu, Y., 2022. Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy, in: Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 486– 496

2022
[65]

Segmatch: semi-supervised surgical instrument segmentation

Wei, M., Budd, C., Garcia-Peraza-Herrera, L.C., Dorent, R., Shi, M., Vercauteren, T., 2025. Segmatch: semi-supervised surgical instrument segmentation. Scientific Reports 15, 14042

2025
[66]

Accflow: Backward accumulation for long- range optical flow, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Wu, G., Liu, X., Luo, K., Liu, X., Zheng, Q., Liu, S., Jiang, X., Zhai, G., Wang, W., 2023. Accflow: Backward accumulation for long- range optical flow, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12119–12128

2023
[67]

Appearance-based refinement for object-centric motion segmentation, in: European Conference on Computer Vision, Springer

Xie, J., Xie, W., Zisserman, A., 2024. Appearance-based refinement for object-centric motion segmentation, in: European Conference on Computer Vision, Springer. pp. 238–256

2024
[68]

Gmflow: Learning optical flow via global matching, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Xu, H., Zhang, J., Cai, J., Rezatofighi, H., Tao, D., 2022. Gmflow: Learning optical flow via global matching, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8121–8130

2022
[69]

Hard frame detection and online mapping for surgical phase recognition, in: International Conference on Medical ImageComputingandComputer-AssistedIntervention,Springer.pp

Yi, F., Jiang, T., 2019. Hard frame detection and online mapping for surgical phase recognition, in: International Conference on Medical ImageComputingandComputer-AssistedIntervention,Springer.pp. 449–457

2019
[70]

Memory- augmentedsam2fortraining-freesurgicalvideosegmentation,in:In- ternationalConferenceonMedicalImageComputingandComputer- Assisted Intervention, Springer

Yin, M., Wang, F., Ye, X., Meng, Y., Fu, Z., 2025. Memory- augmentedsam2fortraining-freesurgicalvideosegmentation,in:In- ternationalConferenceonMedicalImageComputingandComputer- Assisted Intervention, Springer. pp. 328–337

2025
[71]

Yu, J., Wang, A., Dong, W., Xu, M., Islam, M., Wang, J., Bai, L., Ren, H., 2025. Sam 2 in robotic surgery: An empirical evaluation for robustness and generalization in surgical video segmentation, in: International Workshop on Efficient Medical Artificial Intelligence, Springer. pp. 174–183

2025
[72]

Yu, Y., Zhao, Z., Jin, Y., Chen, G., Dou, Q., Heng, P.A., 2022. Pseudo-label guided cross-video pixel contrast for robotic surgical scene segmentation with limited annotations, in: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE. pp. 10857–10864

2022
[73]

Semisam+: rethinking semi-supervised medical image segmentation in the era of foundation models

Zhang, Y., Lv, B., Xue, L., Zhang, W., Liu, Y., Fu, Y., Cheng, Y., Qi, Y., 2025. Semisam+: rethinking semi-supervised medical image segmentation in the era of foundation models. Medical Image Analysis , 103733

2025
[74]

Nasalseg: A dataset for automatic segmentationofnasalcavityandparanasalsinusesfrom3dctimages

Zhang, Y., Wang, J., Pan, T., Jiang, Q., Ge, J., Guo, X., Jiang, C., Lu, J., Zhang, J., Liu, X., et al., 2024. Nasalseg: A dataset for automatic segmentationofnasalcavityandparanasalsinusesfrom3dctimages. Scientific Data 11, 1329

2024
[75]

A survey on multi-task learning

Zhang, Y., Yang, Q., 2021. A survey on multi-task learning. IEEE transactions on knowledge and data engineering 34, 5586–5609

2021
[76]

Zhao, Z., Jin, Y., Gao, X., Dou, Q., Heng, P.A., 2020. Learn- ing motion flows for semi-supervised instrument segmentation from roboticsurgicalvideo,in:InternationalConferenceonMedicalImage Computing and Computer-Assisted Intervention, Springer. pp. 679– 689. Page 17 of 17

2020

[1] [1]

Deeplearningforsurgicalinstrumentrecognitionandsegmentationin robotic-assistedsurgeries:asystematicreview

Ahmed,F.A.,Yousef,M.,Ahmed,M.A.,Ali,H.O.,Mahboob,A.,Ali, H.,Shah,Z.,Aboumarzouk,O.,AlAnsari,A.,Balakrishnan,S.,2024. Deeplearningforsurgicalinstrumentrecognitionandsegmentationin robotic-assistedsurgeries:asystematicreview. ArtificialIntelligence Review 58, 1

2024

[2] [2]

Multitask learning in minimallyinvasivesurgicalvision:Areview.MedicalImageAnalysis 101, 103480

Alabi, O., Vercauteren, T., Shi, M., 2025. Multitask learning in minimallyinvasivesurgicalvision:Areview.MedicalImageAnalysis 101, 103480

2025

[3] [3]

2018 robotic scene segmentation challenge

Allan, M., Kondo, S., Bodenstedt, S., Leger, S., Kadkhodamoham- madi, R., Luengo, I., Fuentes, F., Flouty, E., Mohammed, A., Ped- ersen, M., et al., 2020. 2018 robotic scene segmentation challenge. arXiv preprint arXiv:2001.11190

arXiv 2020

[4] [4]

Themedicalsegmentationdecathlon

Antonelli, M., Reinke, A., Bakas, S., Farahani, K., Kopp-Schneider, A., Landman, B.A., Litjens, G., Menze, B., Ronneberger, O., Sum- mers,R.M.,etal.,2022. Themedicalsegmentationdecathlon. Nature communications 13, 4128

2022

[5] [5]

Pixel-wise recognition for holistic surgical scene under- standing

Ayobi, N., Rodríguez, S., Pérez, A., Hernández, I., Aparicio, N., Dessevres, E., Peña, S., Santander, J., Caicedo, J.I., Fernández, N., et al., 2024. Pixel-wise recognition for holistic surgical scene under- standing. arXiv preprint arXiv:2401.11174

arXiv 2024

[6] [6]

Baghbaderani, R.K., Li, Y., Wang, S., Qi, H., 2024. Temporally- consistentvideosemanticsegmentationwithbidirectionalocclusion- guidedfeaturepropagation,in:ProceedingsoftheIEEE/CVFWinter Conference on Applications of Computer Vision, pp. 685–695

2024

[7] [7]

Semi- supervised learning for network-based cardiac mr image segmenta- tion, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer

Bai, W., Oktay, O., Sinclair, M., Suzuki, H., Rajchl, M., Tarroni, G., Glocker, B., King, A., Matthews, P.M., Rueckert, D., 2017. Semi- supervised learning for network-based cardiac mr image segmenta- tion, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 253–260

2017

[8] [8]

Multitask learning

Caruana, R., 1997. Multitask learning. Machine learning 28, 41–75

1997

[9] [9]

Scientific Reports 12, 19721

Chen,Q.,Poullis,C.,2022.Motionestimationforlargedisplacements and deformations. Scientific Reports 12, 19721

2022

[10] [10]

Masked-attention mask transformer for universal image segmenta- tion,in:ProceedingsoftheIEEE/CVFconferenceoncomputervision and pattern recognition, pp

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R., 2022. Masked-attention mask transformer for universal image segmenta- tion,in:ProceedingsoftheIEEE/CVFconferenceoncomputervision and pattern recognition, pp. 1290–1299

2022

[11] [11]

Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model, in: European conference on computer vision, Springer

Cheng, H.K., Schwing, A.G., 2022. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model, in: European conference on computer vision, Springer. pp. 640–658

2022

[12] [12]

Rethinking space-time networks with improved memory coverage for efficient video object segmentation

Cheng, H.K., Tai, Y.W., Tang, C.K., 2021. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advancesinneuralinformationprocessingsystems34, 11781–11794

2021

[13] [13]

Isrobotic-assistedsurgerybetter? AMA Journal of Ethics 25, 598–604

Chuchulo,A.,Ali,A.,2023. Isrobotic-assistedsurgerybetter? AMA Journal of Ethics 25, 598–604

2023

[14] [14]

Multi-tasklearningwithdeepneuralnetworks: A survey

Crawshaw,M.,2020. Multi-tasklearningwithdeepneuralnetworks: A survey. arXiv preprint arXiv:2009.09796

arXiv 2020

[15] [15]

Tecno: Surgical phase recognition with multi-stage temporal convolutional networks, in: International conferenceonmedicalimagecomputingandcomputer-assistedinter- vention, Springer

Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N., 2020. Tecno: Surgical phase recognition with multi-stage temporal convolutional networks, in: International conferenceonmedicalimagecomputingandcomputer-assistedinter- vention, Springer. pp. 343–352

2020

[16] [16]

Deep learning in surgical workflow analysis: a review of phase and step recognition

Demir, K.C., Schieber, H., Weise, T., Roth, D., May, M., Maier, A., Yang, S.H., 2023. Deep learning in surgical workflow analysis: a review of phase and step recognition. IEEE Journal of Biomedical and Health Informatics 27, 5405–5417. Page 15 of 17 Temporally Consistent Label Interpolation for Robust Surgical Multi-Task Learning under Challenging Conditions

2023

[17] [17]

Thepascalvisualobjectclasseschallenge:A retrospective

Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J.,Zisserman,A.,2015. Thepascalvisualobjectclasseschallenge:A retrospective. International journal of computer vision 111, 98–136

2015

[18] [18]

Proceed- ings of the IEEE International Conference on Computer Vision, 99 92–10002 (2021) https://doi.org/10.1109/ICCV48922.2021.00986

Fan,H.,Xiong,B.,Mangalam,K.,Li,Y.,Yan,Z.,Malik,J.,Feichten- hofer,C.,2021.Multiscalevisiontransformers,in:IEEEInternational Conference on Computer Vision. doi:10.1109/ICCV48922.2021.00675

work page doi:10.1109/iccv48922.2021.00675 2021

[19] [19]

Gao, X., Jin, Y., Long, Y., Dou, Q., Heng, P.A., 2021. Trans-svnet: Accurate phase recognition from surgical videos via hybrid embed- dingaggregationtransformer,in:Internationalconferenceonmedical image computing and computer-assisted intervention, Springer. pp. 593–603

2021

[20] [20]

In: 2018 IEEE/CVF 18 R

Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D.A., Toderici, G., Li, Y., Ricco, S., Sukthankar, R., Schmid, C., Malik, J., 2017. Ava: A video dataset of spatio-temporally localized atomic visual actions, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. doi:10.1109/CVPR.2018.00633

work page doi:10.1109/cvpr.2018.00633 2017

[21] [21]

Role of robotic- assisted surgery in public health: its advantages and challenges

Handa, A., Gaidhane, A., Choudhari, S.G., 2024. Role of robotic- assisted surgery in public health: its advantages and challenges. Cureus 16

2024

[22] [22]

Micro-surgical anastomose workflow recognition challenge report

Huaulmé, A., Sarikaya, D., Le Mut, K., Despinoy, F., Long, Y., Dou, Q., Chng, C.B., Lin, W., Kondo, S., Bravo-Sánchez, L., et al., 2021. Micro-surgical anastomose workflow recognition challenge report. Computer Methods and Programs in Biomedicine 212, 106452

2021

[23] [23]

Microsurgical instru- ment segmentation for robot-assisted surgery

Jeong, T.K., Kim, G., Park, J., 2025. Microsurgical instru- ment segmentation for robot-assisted surgery. arXiv preprint arXiv:2509.11727

arXiv 2025

[24] [24]

Jin,Y.,Cheng,K.,Dou,Q.,Heng,P.A.,2019. Incorporatingtemporal prior from motion flow for instrument segmentation in minimally invasivesurgeryvideo,in:Internationalconferenceonmedicalimage computing and computer-assisted intervention, Springer. pp. 440– 448

2019

[25] [25]

Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network

Jin,Y.,Dou,Q.,Chen,H.,Yu,L.,Qin,J.,Fu,C.W.,Heng,P.A.,2017. Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37, 1114–1126

2017

[26] [26]

Multi-task recurrent convolutional network with correlation loss for surgical video analysis

Jin,Y.,Li,H.,Dou,Q.,Chen,H.,Qin,J.,Fu,C.W.,Heng,P.A.,2020. Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572

2020

[27] [27]

Segment anything, in: Proceedings of the IEEE/CVF international conference on computer vision, pp

Kirillov,A.,Mintun,E.,Ravi,N.,Mao,H.,Rolland,C.,Gustafson,L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al., 2023. Segment anything, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 4015–4026

2023

[28] [28]

Concurrentsegmentationandlocaliza- tion for tracking of surgical instruments, in: International conference on medical image computing and computer-assisted intervention, Springer

Laina, I., Rieke, N., Rupprecht, C., Vizcaíno, J.P., Eslami, A., Tombari,F.,Navab,N.,2017. Concurrentsegmentationandlocaliza- tion for tracking of surgical instruments, in: International conference on medical image computing and computer-assisted intervention, Springer. pp. 664–672

2017

[29] [29]

Surgical process modelling: a review

Lalys, F., Jannin, P., 2014. Surgical process modelling: a review. International journal of computer assisted radiology and surgery 9, 495–511

2014

[30] [30]

Pseudo-label: The simple and efficient semi- supervised learning method for deep neural networks, in: Workshop on challenges in representation learning, ICML, Atlanta

Lee, D.H., et al., 2013. Pseudo-label: The simple and efficient semi- supervised learning method for deep neural networks, in: Workshop on challenges in representation learning, ICML, Atlanta. p. 896

2013

[31] [31]

Recurrent dynamicembeddingforvideoobjectsegmentation,in:Proceedingsof theIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion, pp

Li,M.,Hu,L.,Xiong,Z.,Zhang,B.,Pan,P.,Liu,D.,2022. Recurrent dynamicembeddingforvideoobjectsegmentation,in:Proceedingsof theIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion, pp. 1332–1341

2022

[32] [32]

Drift robust non-rigid optical flowenhancementforlongsequences

Li, W., Cosker, D., Brown, M., 2016. Drift robust non-rigid optical flowenhancementforlongsequences. JournalofIntelligent&Fuzzy Systems 31, 2583–2595

2016

[33] [33]

Deep learning for surgical workflow analysis: a survey of progresses, limitations, and trends

Li, Y., Zhao, Z., Li, R., Li, F., 2024. Deep learning for surgical workflow analysis: a survey of progresses, limitations, and trends. Artificial Intelligence Review 57, 291

2024

[34] [34]

Surgical sam 2: Real-time segment anything in surgical video by efficient frame pruning

Liu, H., Zhang, E., Wu, J., Hong, M., Jin, Y., 2024. Surgical sam 2: Real-time segment anything in surgical video by efficient frame pruning. arXiv preprint arXiv:2408.07931

arXiv 2024

[35] [35]

Liu,Z.,Lin,Y.,Cao,Y.,Hu,H.,Wei,Y.,Zhang,Z.,Lin,S.,Guo,B.,

[36] [36]

10012–10022

Swintransformer:Hierarchicalvisiontransformerusingshifted windows, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022

[37] [37]

Semi-supervised medical image segmentation via uncertainty rectified pyramid consistency

Luo, X., Wang, G., Liao, W., Chen, J., Song, T., Chen, Y., Zhang, S., Metaxas, D.N., Zhang, S., 2022. Semi-supervised medical image segmentation via uncertainty rectified pyramid consistency. Medical Image Analysis 80, 102517

2022

[38] [38]

Maier-Hein,L.,Vedula,S.S.,Speidel,S.,Navab,N.,Kikinis,R.,Park, A., Eisenmann, M., Feussner, H., Forestier, G., Giannarou, S., et al.,

[39] [39]

Nature Biomedical Engineering 1, 691–696

Surgicaldatasciencefornext-generationinterventions. Nature Biomedical Engineering 1, 691–696

[40] [40]

Robotic surgery: applications, limitations, and impact on surgical education

Morris, B., 2005. Robotic surgery: applications, limitations, and impact on surgical education. Medscape General Medicine 7, 72

2005

[41] [41]

Joint-task regulariza- tion for partially labeled multi-task learning, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp

Nishi, K., Kim, J., Li, W., Pfister, H., 2024. Joint-task regulariza- tion for partially labeled multi-task learning, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp. 16152–16162

2024

[42] [42]

Nwoye, C.I., Gonzalez, C., Yu, T., Mascagni, P., Mutter, D., Marescaux, J., Padoy, N., 2020. Recognition of instrument-tissue interactions in endoscopic videos via action triplets, in: International conferenceonmedicalimagecomputingandcomputer-assistedinter- vention, Springer. pp. 364–374

2020

[43] [43]

Video object seg- mentationusingspace-timememorynetworks,in:Proceedingsofthe IEEE/CVF international conference on computer vision, pp

Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J., 2019. Video object seg- mentationusingspace-timememorynetworks,in:Proceedingsofthe IEEE/CVF international conference on computer vision, pp. 9226– 9235

2019

[44] [44]

The 2017 davis challenge on video object segmentation

Pont-Tuset,J.,Perazzi,F.,Caelles,S.,Arbeláez,P.,Sorkine-Hornung, A., Van Gool, L., 2017. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675

Pith/arXiv arXiv 2017

[45] [45]

Robust in- stancetrackingviauncertaintyflow.arXivpreprintarXiv:2010.04367

Qian, J., Nan, J., Ancha, S., Okorn, B., Held, D., 2020. Robust in- stancetrackingviauncertaintyflow.arXivpreprintarXiv:2010.04367

arXiv 2020

[46] [46]

Weakly supervised temporal convolutional networks for fine-grained surgical activity recognition

Ramesh,S.,Dall’Alba,D.,Gonzalez,C.,Yu,T.,Mascagni,P.,Mutter, D., Marescaux, J., Fiorini, P., Padoy, N., 2023. Weakly supervised temporal convolutional networks for fine-grained surgical activity recognition. IEEE Transactions on Medical Imaging 42, 2592–2602

2023

[47] [47]

Sam 2: Segmentanythinginimagesandvideos,in:InternationalConference on Learning Representations, pp

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al., 2025. Sam 2: Segmentanythinginimagesandvideos,in:InternationalConference on Learning Representations, pp. 28085–28128

2025

[48] [48]

Unsupervised learningofopticalflowwithpatchconsistencyandocclusionestima- tion

Ren, Z., Yan, J., Yang, X., Yuille, A., Zha, H., 2020. Unsupervised learningofopticalflowwithpatchconsistencyandocclusionestima- tion. Pattern Recognition 103, 107191

2020

[49] [49]

Rivoir, D., Bodenstedt, S., Funke, I., von Bechtolsheim, F., Distler, M., Weitz, J., Speidel, S., 2020. Rethinking anticipation tasks: Uncertainty-aware anticipation of sparse surgical instrument usage for context-aware assistance, in: International conference on medical image computing and computer-assisted intervention, Springer. pp. 752–762

2020

[50] [50]

U-net: Convolutional networks for biomedical image segmentation, in: International Con- ferenceonMedicalimagecomputingandcomputer-assistedinterven- tion, Springer

Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation, in: International Con- ferenceonMedicalimagecomputingandcomputer-assistedinterven- tion, Springer. pp. 234–241

2015

[51] [51]

An overview of multi-task learning in deep neural networks

Ruder, S., 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098

Pith/arXiv arXiv 2017

[52] [52]

Evaluation of extra pixel interpolation with maskprocessingformedicalimagesegmentationwithdeeplearning

Rukundo, O., 2024. Evaluation of extra pixel interpolation with maskprocessingformedicalimagesegmentationwithdeeplearning. Signal, Image and Video Processing 18, 7703–7710

2024

[53] [53]

Robotic surgery

Schreuder, H., Verheijen, R., 2009. Robotic surgery. BJOG: An International Journal of Obstetrics & Gynaecology 116, 198–213

2009

[54] [54]

Fun-sis: A fully unsupervised approach for surgical instrument seg- mentation

Sestini, L., Rosa, B., De Momi, E., Ferrigno, G., Padoy, N., 2023. Fun-sis: A fully unsupervised approach for surgical instrument seg- mentation. Medical Image Analysis 85, 102751

2023

[55] [55]

Hierarchical image saliency detection on extended cssd

Shi, J., Yan, Q., Xu, L., Jia, J., 2015. Hierarchical image saliency detection on extended cssd. IEEE transactions on pattern analysis and machine intelligence 38, 717–729

2015

[56] [56]

Semi-supervisedlearning withprogressiveunlabeleddataexcavationforlabel-efficientsurgical workflow recognition

Shi,X.,Jin,Y.,Dou,Q.,Heng,P.A.,2021. Semi-supervisedlearning withprogressiveunlabeleddataexcavationforlabel-efficientsurgical workflow recognition. Medical Image Analysis 73, 102158. Page 16 of 17 Temporally Consistent Label Interpolation for Robust Surgical Multi-Task Learning under Challenging Conditions

2021

[57] [57]

Auto- matic instrument segmentation in robot-assisted surgery using deep learning, in: 2018 17th IEEE international conference on machine learning and applications (ICMLA), IEEE

Shvets,A.A.,Rakhlin,A.,Kalinin,A.A.,Iglovikov,V.I.,2018. Auto- matic instrument segmentation in robot-assisted surgery using deep learning, in: 2018 17th IEEE international conference on machine learning and applications (ICMLA), IEEE. pp. 624–628

2018

[58] [58]

Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results

Tarvainen, A., Valpola, H., 2017. Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems 30

2017

[59] [59]

Raft:Recurrentall-pairsfieldtransformsfor optical flow, in: European conference on computer vision, Springer

Teed,Z.,Deng,J.,2020. Raft:Recurrentall-pairsfieldtransformsfor optical flow, in: European conference on computer vision, Springer. pp. 402–419

2020

[60] [60]

Is learning the n-th thing any easier than learning the first? Advances in neural information processing systems 8

Thrun, S., 1995. Is learning the n-th thing any easier than learning the first? Advances in neural information processing systems 8

1995

[61] [61]

Endonet: A deep architecture for recognition tasks on laparoscopic videos

Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy, N., 2016. Endonet: A deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging 36, 86–97

2016

[62] [62]

Towards holistic surgical scene understanding, in: International con- ferenceonmedicalimagecomputingandcomputer-assistedinterven- tion, Springer

Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P., 2022. Towards holistic surgical scene understanding, in: International con- ferenceonmedicalimagecomputingandcomputer-assistedinterven- tion, Springer. pp. 442–452

2022

[63] [63]

Look before you match: Instance understanding matters in video object segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pp

Wang,J.,Chen,D.,Wu,Z.,Luo,C.,Tang,C.,Dai,X.,Zhao,Y.,Xie, Y., Yuan, L., Jiang, Y.G., 2023. Look before you match: Instance understanding matters in video object segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pp. 2268–2278

2023

[64] [64]

Wang, Z., Lu, B., Long, Y., Zhong, F., Cheung, T.H., Dou, Q., Liu, Y., 2022. Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy, in: Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 486– 496

2022

[65] [65]

Segmatch: semi-supervised surgical instrument segmentation

Wei, M., Budd, C., Garcia-Peraza-Herrera, L.C., Dorent, R., Shi, M., Vercauteren, T., 2025. Segmatch: semi-supervised surgical instrument segmentation. Scientific Reports 15, 14042

2025

[66] [66]

Accflow: Backward accumulation for long- range optical flow, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Wu, G., Liu, X., Luo, K., Liu, X., Zheng, Q., Liu, S., Jiang, X., Zhai, G., Wang, W., 2023. Accflow: Backward accumulation for long- range optical flow, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12119–12128

2023

[67] [67]

Appearance-based refinement for object-centric motion segmentation, in: European Conference on Computer Vision, Springer

Xie, J., Xie, W., Zisserman, A., 2024. Appearance-based refinement for object-centric motion segmentation, in: European Conference on Computer Vision, Springer. pp. 238–256

2024

[68] [68]

Gmflow: Learning optical flow via global matching, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Xu, H., Zhang, J., Cai, J., Rezatofighi, H., Tao, D., 2022. Gmflow: Learning optical flow via global matching, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8121–8130

2022

[69] [69]

Hard frame detection and online mapping for surgical phase recognition, in: International Conference on Medical ImageComputingandComputer-AssistedIntervention,Springer.pp

Yi, F., Jiang, T., 2019. Hard frame detection and online mapping for surgical phase recognition, in: International Conference on Medical ImageComputingandComputer-AssistedIntervention,Springer.pp. 449–457

2019

[70] [70]

Memory- augmentedsam2fortraining-freesurgicalvideosegmentation,in:In- ternationalConferenceonMedicalImageComputingandComputer- Assisted Intervention, Springer

Yin, M., Wang, F., Ye, X., Meng, Y., Fu, Z., 2025. Memory- augmentedsam2fortraining-freesurgicalvideosegmentation,in:In- ternationalConferenceonMedicalImageComputingandComputer- Assisted Intervention, Springer. pp. 328–337

2025

[71] [71]

Yu, J., Wang, A., Dong, W., Xu, M., Islam, M., Wang, J., Bai, L., Ren, H., 2025. Sam 2 in robotic surgery: An empirical evaluation for robustness and generalization in surgical video segmentation, in: International Workshop on Efficient Medical Artificial Intelligence, Springer. pp. 174–183

2025

[72] [72]

Yu, Y., Zhao, Z., Jin, Y., Chen, G., Dou, Q., Heng, P.A., 2022. Pseudo-label guided cross-video pixel contrast for robotic surgical scene segmentation with limited annotations, in: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE. pp. 10857–10864

2022

[73] [73]

Semisam+: rethinking semi-supervised medical image segmentation in the era of foundation models

Zhang, Y., Lv, B., Xue, L., Zhang, W., Liu, Y., Fu, Y., Cheng, Y., Qi, Y., 2025. Semisam+: rethinking semi-supervised medical image segmentation in the era of foundation models. Medical Image Analysis , 103733

2025

[74] [74]

Nasalseg: A dataset for automatic segmentationofnasalcavityandparanasalsinusesfrom3dctimages

Zhang, Y., Wang, J., Pan, T., Jiang, Q., Ge, J., Guo, X., Jiang, C., Lu, J., Zhang, J., Liu, X., et al., 2024. Nasalseg: A dataset for automatic segmentationofnasalcavityandparanasalsinusesfrom3dctimages. Scientific Data 11, 1329

2024

[75] [75]

A survey on multi-task learning

Zhang, Y., Yang, Q., 2021. A survey on multi-task learning. IEEE transactions on knowledge and data engineering 34, 5586–5609

2021

[76] [76]

Zhao, Z., Jin, Y., Gao, X., Dou, Q., Heng, P.A., 2020. Learn- ing motion flows for semi-supervised instrument segmentation from roboticsurgicalvideo,in:InternationalConferenceonMedicalImage Computing and Computer-Assisted Intervention, Springer. pp. 679– 689. Page 17 of 17

2020