pith. sign in

arxiv: 2606.26634 · v1 · pith:TSEHCK23new · submitted 2026-06-25 · 💻 cs.CV

Temporally Consistent Label Interpolation for Robust Surgical Multi-Task Learning under Challenging Conditions

Pith reviewed 2026-06-26 05:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords surgical scene understandingmulti-task learninglabel interpolationoptical flowzero-shot segmentationtemporal consistencyinstrument segmentationphase recognition
0
0 comments X

The pith

A flow-guided framework generates dense pseudo labels from sparse surgical keyframes to balance multi-task learning across temporal and spatial tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies a core mismatch where temporal tasks like phase recognition receive dense frame-level labels while spatial tasks like instrument segmentation are annotated only on sparse keyframes. It introduces FAROS to interpolate these sparse labels into temporally consistent dense pseudo labels by combining zero-shot segmentation mask propagation with optical flow estimation. The resulting labels feed into a single Transformer-based model that jointly optimizes phase recognition, step recognition, anticipation, instrument segmentation, and action recognition. Experiments on GraSP, MISAW, and AutoLaparo show gains in cross-task representation learning, with additional validation on DAVIS 2017 confirming the interpolation works beyond surgery.

Core claim

FAROS generates temporally consistent dense pseudo labels from sparse keyframe annotations by combining zero-shot segmentation-based mask propagation with optical flow estimation; these labels are then integrated into a unified Transformer-based multi-task framework that jointly learns surgical phase recognition, step recognition, anticipation, instrument segmentation, and action recognition, enabling balanced optimization between dense temporal supervision and sparse spatial supervision.

What carries the argument

FAROS, a flow-guided label interpolation framework that merges zero-shot segmentation mask propagation with optical flow estimation to produce consistent dense pseudo labels.

If this is right

  • The unified Transformer model achieves higher performance on all five tasks simultaneously on GraSP, MISAW, and AutoLaparo.
  • Cross-task representation learning improves because dense temporal supervision now aligns with dense spatial supervision.
  • Label interpolation quality holds on the non-surgical DAVIS 2017 benchmark under a sparse ground-truth protocol.
  • Joint optimization becomes feasible without separate handling of annotation density differences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could lower labeling costs for new surgical datasets by requiring annotations only on keyframes.
  • Similar interpolation might apply to other video domains such as autonomous driving or sports analysis where spatial labels are expensive.
  • If the propagation step is made faster, the method could support online surgical assistance systems.
  • The framework highlights a general pattern: using motion cues to densify supervision when tasks have mismatched annotation densities.

Load-bearing premise

Combining zero-shot segmentation-based mask propagation with optical flow estimation reliably overcomes the limits of appearance-based methods under occlusion, smoke, and motion blur.

What would settle it

A test set of surgical videos with heavy smoke or motion blur where the generated dense pseudo labels show lower accuracy than a ground-truth dense annotation baseline when measured by segmentation or action label metrics.

Figures

Figures reproduced from arXiv: 2606.26634 by Garam Kim, Juyoun Park.

Figure 1
Figure 1. Figure 1: Illustration of the annotation granularity mismatch and the motivation for label interpolation. (Left) Frame-level workflow tasks benefit from dense temporal annotations across all frames, whereas pixel-level instrument tasks are sparsely annotated only at selected keyframes. Label interpolation bridges this supervision gap by generating dense pseudo-annotations for all intermediate frames. (Right) Represe… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed FAROS pipeline. (Top) The SAM2-based propagation module processes input frames through an image encoder, memory attention, and mask decoder, with a prompt encoder providing corrective spatial prompts. Propagated mask features are stored in a memory bank via the memory encoder for temporally coherent propagation. (Bottom) The flow￾guided module operates between keyframes 𝐾0 and 𝐾1 ,… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of mask propagation on an instrument exit case. SAM2-only propagation fails to track instrument disappearance and re-entry, producing temporally inconsistent masks. FAROS detects the propagation failure via flow-guided consistency checking and recovers accurate segmentation through reprompting [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of mask propagation under illumination variation and blood-induced occlusion. SAM2-only propagation degrades under these appearance disruptions, whereas FAROS maintains robust segmentation by leveraging geometric motion priors to compensate for appearance-driven memory attention failures. Bidirectional SAM2 Prompting. We process each keyframe segment (𝑘0 , 𝑘1 ) independently. Naively… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the proposed multi-task learning framework. (1. Instrument Segmentation Baseline) A Mask2Former-based RPN is trained on ground-truth and pseudo masks (weighted by 𝑤) to produce segment embeddings for mask prediction and class scoring. (2. All-task Finetuning) The frozen RPN provides precomputed region embeddings. An MViT backbone with task-dedicated CLS tokens extracts spatio-temporal features … view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of mask propagation results on the DAVIS 2017 validation set under the sparse ground-truth interpolation protocol (ground truth provided every 30 frames). FAROS maintains temporally consistent segmentation across diverse challenging scenarios including fast-moving objects, partial occlusion, and large inter-frame appearance changes, while SAM2 baseline propagation exhibits mask drift… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of Instrument Segmentation Results for Comparison on the GraSP Dataset [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of Instrument Segmentation Results for Comparison on the MISAW Dataset also recovers and surpasses the single-task baseline across all metrics, further validating that flow-guided label in￾terpolation effectively resolves annotation imbalance and enables balanced cross-task optimization. Qualitative results in [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of Instrument Segmentation Results for Comparison on the AutoLaparo Dataset [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Co-occurrence matrix between instrument and step in the GraSP. Each cell indicates the proportion of frames in which a given instrument appears during the corresponding step, normalized per instrument row. The Large Needle Driver exhibits strong co-occurrence with multiple suture-related step categories, while the Clip Applier is predominantly associated with the Clip Pedicles step [PITH_FULL_IMAGE:figur… view at source ↗
Figure 11
Figure 11. Figure 11: Per-class Step recognition AP change(ΔAP = w/ inter − w/o inter)(left) and instrument segmentation IoU change (ΔIoU = w/ inter − w/o inter)(right) on GraSP. keep pace with the surgical video stream. On the workflow recognition tasks, our framework attains a mean per-clip latency of 22.72 ± 5.49 ms, corresponding to a throughput of 44.02 clips/s at 32-bit precision on a single GPU, com￾fortably exceeding t… view at source ↗
Figure 12
Figure 12. Figure 12: Scatter plot of instrument ΔIoU versus linked step ΔAP on GraSP. Each bubble represents an instrument–step pair with strong co-occurrence, and bubble size is proportional to the magnitude of instrument IoU change. this, we propose FAROS, a flow-guided label interpola￾tion framework that combines promptable segmentation￾based mask propagation with optical flow estimation to generate temporally consistent d… view at source ↗
Figure 13
Figure 13. Figure 13: Co-occurrence matrix between instrument and step categories in the MISAW. The Needle instrument shows strong association with the Suture Making and Needle Holding categories, reflecting its functional role in suturing-phase procedures [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Scatter plot of instrument ΔIoU versus linked step ΔAP on MISAW. Bubble size reflects the magnitude of instrument IoU change. The positive correlation confirms that instruments benefiting most from interpolation propagate per￾formance gains to their semantically associated step categories, consistent with findings on GraSP [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative comparison of mask propagation on the DAVIS 2017 drone subset (sparse-GT, interval = 30). Blue and red masks denote the person and drone objects, respectively. As the drone enters the scene, SAM2 baseline fails to detect and track it, producing missing masks throughout the re-entry sequence. FAROS successfully recovers accurate segmentation of both objects via flow-guided reprompting, maintain… view at source ↗
read the original abstract

Effective multi-task learning for surgical scene understanding is fundamentally hindered by annotation granularity mismatch; temporal workflow tasks such as phase recognition, step recognition and anticipation benefit from dense frame-level supervision, whereas pixel-level spatial tasks including instrument segmentation and action recognition are only sparsely annotated on selected keyframes due to prohibitive labeling costs. This supervision imbalance undermines shared representation learning and limits joint optimization across heterogeneous surgical tasks. To address this, we propose Flow-guided Annotation for Robust Operating Scenes (FAROS), a flow-guided label interpolation framework, that combines zero-shot segmentation-based mask propagation with optical flow estimation to overcome the limitations of appearance-based propagation under challenging surgical conditions such as occlusion, smoke, and motion blur, generating temporally consistent dense pseudo labels from sparse keyframe annotations. The densified instrument masks and action labels are integrated into a unified Transformer-based multi-task framework that jointly learns surgical phase recognition, step recognition, anticipation, instrument segmentation, and action recognition, enabling balanced optimization between dense temporal supervision and sparse spatial supervision. The label interpolation quality of FAROS is first validated on the DAVIS 2017 benchmark under a sparse ground-truth protocol, confirming robust propagation beyond the surgical domain. Extensive experiments on GraSP, MISAW, and AutoLaparo benchmarks further demonstrate that FAROS significantly improves cross-task representation learning and enhances holistic surgical scene understanding performance across spatio-temporal tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes FAROS (Flow-guided Annotation for Robust Operating Scenes), a framework that combines zero-shot segmentation-based mask propagation with optical flow estimation to generate temporally consistent dense pseudo labels from sparse keyframe annotations. These densified labels are integrated into a unified Transformer-based multi-task framework for joint learning of phase recognition, step recognition, anticipation, instrument segmentation, and action recognition. The interpolation quality is validated on DAVIS 2017 under a sparse ground-truth protocol, with claims of significant performance improvements on the GraSP, MISAW, and AutoLaparo surgical benchmarks under challenging conditions.

Significance. If the central claims hold, the work addresses a practical annotation imbalance in surgical scene understanding and could improve cross-task representation learning by providing dense supervision for temporal tasks while leveraging sparse spatial annotations. The cross-domain validation on DAVIS 2017 and the focus on robustness to occlusion, smoke, and motion blur represent potential strengths for generalizability.

major comments (3)
  1. Abstract: The abstract asserts that FAROS 'significantly improves cross-task representation learning and enhances holistic surgical scene understanding performance across spatio-temporal tasks' on GraSP, MISAW, and AutoLaparo, yet provides no quantitative results, error bars, ablation details, or specific metrics to support these claims, preventing any assessment of the magnitude or reliability of the reported gains.
  2. Abstract: The core claim that the zero-shot segmentation + optical flow combination produces accurate, temporally consistent dense pseudo labels specifically under surgical challenges (occlusion, smoke, motion blur) is not supported by direct label-quality metrics or ablations on the target surgical datasets; validation is restricted to DAVIS 2017, leaving open the possibility that any end-task gains arise from the unified Transformer or joint optimization rather than the interpolation mechanism.
  3. Abstract and methods description: No details are given on the unified Transformer architecture, the joint loss formulation, training protocol, or ablation studies isolating the contribution of the FAROS-generated labels versus baseline multi-task learning, which are load-bearing for attributing improvements to the proposed interpolation.
minor comments (1)
  1. The expansion of the FAROS acronym appears only after its first use; including it in the title or abstract opening sentence would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments highlighting areas where the abstract could better support the claims. We will revise the abstract to incorporate quantitative results and clarify the validation approach. We address each major comment below.

read point-by-point responses
  1. Referee: Abstract: The abstract asserts that FAROS 'significantly improves cross-task representation learning and enhances holistic surgical scene understanding performance across spatio-temporal tasks' on GraSP, MISAW, and AutoLaparo, yet provides no quantitative results, error bars, ablation details, or specific metrics to support these claims, preventing any assessment of the magnitude or reliability of the reported gains.

    Authors: We agree that including key quantitative metrics in the abstract would strengthen the summary. In the revision, we will add specific results such as accuracy improvements on phase recognition (e.g., +X% on GraSP) and mIoU gains on instrument segmentation, along with references to ablation studies, while keeping the abstract concise. revision: yes

  2. Referee: Abstract: The core claim that the zero-shot segmentation + optical flow combination produces accurate, temporally consistent dense pseudo labels specifically under surgical challenges (occlusion, smoke, motion blur) is not supported by direct label-quality metrics or ablations on the target surgical datasets; validation is restricted to DAVIS 2017, leaving open the possibility that any end-task gains arise from the unified Transformer or joint optimization rather than the interpolation mechanism.

    Authors: The DAVIS 2017 evaluation under sparse annotation protocol establishes the interpolation robustness in challenging conditions analogous to surgery. On surgical benchmarks, the contribution of FAROS labels is isolated via ablations comparing against baseline multi-task learning without densified labels. We will revise the abstract to explicitly note that downstream task gains on GraSP/MISAW/AutoLaparo serve as the primary validation for surgical applicability. revision: partial

  3. Referee: Abstract and methods description: No details are given on the unified Transformer architecture, the joint loss formulation, training protocol, or ablation studies isolating the contribution of the FAROS-generated labels versus baseline multi-task learning, which are load-bearing for attributing improvements to the proposed interpolation.

    Authors: Section 3 details the shared Transformer encoder with task-specific decoders, the joint loss as a weighted combination of cross-entropy (temporal tasks) and segmentation losses, the training protocol (AdamW optimizer, specific schedules, augmentations), and Section 4.3 presents ablations isolating FAROS label contributions. We will update the abstract with a brief reference to these elements for clarity. revision: yes

Circularity Check

0 steps flagged

No circularity; framework is self-contained

full rationale

The paper introduces FAROS as a new combination of zero-shot segmentation-based mask propagation and optical flow estimation to densify sparse keyframe labels, then integrates the results into a unified Transformer multi-task model. Validation occurs on DAVIS 2017 under sparse GT protocol, with end-task metrics reported on GraSP/MISAW/AutoLaparo. No equations, fitted parameters, self-definitional reductions, or load-bearing self-citations appear in the text; the method is presented as an independent engineering solution to annotation imbalance rather than deriving from or renaming prior fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger constructed from abstract only; no explicit free parameters, axioms, or invented entities beyond the proposed framework name are stated.

axioms (1)
  • domain assumption Optical flow and zero-shot segmentation remain reliable under surgical occlusions, smoke, and motion blur
    Invoked as the reason the method succeeds where appearance-based propagation fails.
invented entities (1)
  • FAROS framework no independent evidence
    purpose: Flow-guided label interpolation to produce dense pseudo labels
    Newly proposed system; no independent evidence outside the paper is mentioned.

pith-pipeline@v0.9.1-grok · 5768 in / 1370 out tokens · 27748 ms · 2026-06-26T05:02:45.295945+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 2 canonical work pages

  1. [1]

    Deeplearningforsurgicalinstrumentrecognitionandsegmentationin robotic-assistedsurgeries:asystematicreview

    Ahmed,F.A.,Yousef,M.,Ahmed,M.A.,Ali,H.O.,Mahboob,A.,Ali, H.,Shah,Z.,Aboumarzouk,O.,AlAnsari,A.,Balakrishnan,S.,2024. Deeplearningforsurgicalinstrumentrecognitionandsegmentationin robotic-assistedsurgeries:asystematicreview. ArtificialIntelligence Review 58, 1

  2. [2]

    Multitask learning in minimallyinvasivesurgicalvision:Areview.MedicalImageAnalysis 101, 103480

    Alabi, O., Vercauteren, T., Shi, M., 2025. Multitask learning in minimallyinvasivesurgicalvision:Areview.MedicalImageAnalysis 101, 103480

  3. [3]

    2018 robotic scene segmentation challenge

    Allan, M., Kondo, S., Bodenstedt, S., Leger, S., Kadkhodamoham- madi, R., Luengo, I., Fuentes, F., Flouty, E., Mohammed, A., Ped- ersen, M., et al., 2020. 2018 robotic scene segmentation challenge. arXiv preprint arXiv:2001.11190

  4. [4]

    Themedicalsegmentationdecathlon

    Antonelli, M., Reinke, A., Bakas, S., Farahani, K., Kopp-Schneider, A., Landman, B.A., Litjens, G., Menze, B., Ronneberger, O., Sum- mers,R.M.,etal.,2022. Themedicalsegmentationdecathlon. Nature communications 13, 4128

  5. [5]

    Pixel-wise recognition for holistic surgical scene under- standing

    Ayobi, N., Rodríguez, S., Pérez, A., Hernández, I., Aparicio, N., Dessevres, E., Peña, S., Santander, J., Caicedo, J.I., Fernández, N., et al., 2024. Pixel-wise recognition for holistic surgical scene under- standing. arXiv preprint arXiv:2401.11174

  6. [6]

    Baghbaderani, R.K., Li, Y., Wang, S., Qi, H., 2024. Temporally- consistentvideosemanticsegmentationwithbidirectionalocclusion- guidedfeaturepropagation,in:ProceedingsoftheIEEE/CVFWinter Conference on Applications of Computer Vision, pp. 685–695

  7. [7]

    Semi- supervised learning for network-based cardiac mr image segmenta- tion, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer

    Bai, W., Oktay, O., Sinclair, M., Suzuki, H., Rajchl, M., Tarroni, G., Glocker, B., King, A., Matthews, P.M., Rueckert, D., 2017. Semi- supervised learning for network-based cardiac mr image segmenta- tion, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 253–260

  8. [8]

    Multitask learning

    Caruana, R., 1997. Multitask learning. Machine learning 28, 41–75

  9. [9]

    Scientific Reports 12, 19721

    Chen,Q.,Poullis,C.,2022.Motionestimationforlargedisplacements and deformations. Scientific Reports 12, 19721

  10. [10]

    Masked-attention mask transformer for universal image segmenta- tion,in:ProceedingsoftheIEEE/CVFconferenceoncomputervision and pattern recognition, pp

    Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R., 2022. Masked-attention mask transformer for universal image segmenta- tion,in:ProceedingsoftheIEEE/CVFconferenceoncomputervision and pattern recognition, pp. 1290–1299

  11. [11]

    Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model, in: European conference on computer vision, Springer

    Cheng, H.K., Schwing, A.G., 2022. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model, in: European conference on computer vision, Springer. pp. 640–658

  12. [12]

    Rethinking space-time networks with improved memory coverage for efficient video object segmentation

    Cheng, H.K., Tai, Y.W., Tang, C.K., 2021. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advancesinneuralinformationprocessingsystems34, 11781–11794

  13. [13]

    Isrobotic-assistedsurgerybetter? AMA Journal of Ethics 25, 598–604

    Chuchulo,A.,Ali,A.,2023. Isrobotic-assistedsurgerybetter? AMA Journal of Ethics 25, 598–604

  14. [14]

    Multi-tasklearningwithdeepneuralnetworks: A survey

    Crawshaw,M.,2020. Multi-tasklearningwithdeepneuralnetworks: A survey. arXiv preprint arXiv:2009.09796

  15. [15]

    Tecno: Surgical phase recognition with multi-stage temporal convolutional networks, in: International conferenceonmedicalimagecomputingandcomputer-assistedinter- vention, Springer

    Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N., 2020. Tecno: Surgical phase recognition with multi-stage temporal convolutional networks, in: International conferenceonmedicalimagecomputingandcomputer-assistedinter- vention, Springer. pp. 343–352

  16. [16]

    Deep learning in surgical workflow analysis: a review of phase and step recognition

    Demir, K.C., Schieber, H., Weise, T., Roth, D., May, M., Maier, A., Yang, S.H., 2023. Deep learning in surgical workflow analysis: a review of phase and step recognition. IEEE Journal of Biomedical and Health Informatics 27, 5405–5417. Page 15 of 17 Temporally Consistent Label Interpolation for Robust Surgical Multi-Task Learning under Challenging Conditions

  17. [17]

    Thepascalvisualobjectclasseschallenge:A retrospective

    Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J.,Zisserman,A.,2015. Thepascalvisualobjectclasseschallenge:A retrospective. International journal of computer vision 111, 98–136

  18. [18]

    Proceed- ings of the IEEE International Conference on Computer Vision, 99 92–10002 (2021) https://doi.org/10.1109/ICCV48922.2021.00986

    Fan,H.,Xiong,B.,Mangalam,K.,Li,Y.,Yan,Z.,Malik,J.,Feichten- hofer,C.,2021.Multiscalevisiontransformers,in:IEEEInternational Conference on Computer Vision. doi:10.1109/ICCV48922.2021.00675

  19. [19]

    Gao, X., Jin, Y., Long, Y., Dou, Q., Heng, P.A., 2021. Trans-svnet: Accurate phase recognition from surgical videos via hybrid embed- dingaggregationtransformer,in:Internationalconferenceonmedical image computing and computer-assisted intervention, Springer. pp. 593–603

  20. [20]

    In: 2018 IEEE/CVF 18 R

    Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D.A., Toderici, G., Li, Y., Ricco, S., Sukthankar, R., Schmid, C., Malik, J., 2017. Ava: A video dataset of spatio-temporally localized atomic visual actions, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. doi:10.1109/CVPR.2018.00633

  21. [21]

    Role of robotic- assisted surgery in public health: its advantages and challenges

    Handa, A., Gaidhane, A., Choudhari, S.G., 2024. Role of robotic- assisted surgery in public health: its advantages and challenges. Cureus 16

  22. [22]

    Micro-surgical anastomose workflow recognition challenge report

    Huaulmé, A., Sarikaya, D., Le Mut, K., Despinoy, F., Long, Y., Dou, Q., Chng, C.B., Lin, W., Kondo, S., Bravo-Sánchez, L., et al., 2021. Micro-surgical anastomose workflow recognition challenge report. Computer Methods and Programs in Biomedicine 212, 106452

  23. [23]

    Microsurgical instru- ment segmentation for robot-assisted surgery

    Jeong, T.K., Kim, G., Park, J., 2025. Microsurgical instru- ment segmentation for robot-assisted surgery. arXiv preprint arXiv:2509.11727

  24. [24]

    Jin,Y.,Cheng,K.,Dou,Q.,Heng,P.A.,2019. Incorporatingtemporal prior from motion flow for instrument segmentation in minimally invasivesurgeryvideo,in:Internationalconferenceonmedicalimage computing and computer-assisted intervention, Springer. pp. 440– 448

  25. [25]

    Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network

    Jin,Y.,Dou,Q.,Chen,H.,Yu,L.,Qin,J.,Fu,C.W.,Heng,P.A.,2017. Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37, 1114–1126

  26. [26]

    Multi-task recurrent convolutional network with correlation loss for surgical video analysis

    Jin,Y.,Li,H.,Dou,Q.,Chen,H.,Qin,J.,Fu,C.W.,Heng,P.A.,2020. Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572

  27. [27]

    Segment anything, in: Proceedings of the IEEE/CVF international conference on computer vision, pp

    Kirillov,A.,Mintun,E.,Ravi,N.,Mao,H.,Rolland,C.,Gustafson,L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al., 2023. Segment anything, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 4015–4026

  28. [28]

    Concurrentsegmentationandlocaliza- tion for tracking of surgical instruments, in: International conference on medical image computing and computer-assisted intervention, Springer

    Laina, I., Rieke, N., Rupprecht, C., Vizcaíno, J.P., Eslami, A., Tombari,F.,Navab,N.,2017. Concurrentsegmentationandlocaliza- tion for tracking of surgical instruments, in: International conference on medical image computing and computer-assisted intervention, Springer. pp. 664–672

  29. [29]

    Surgical process modelling: a review

    Lalys, F., Jannin, P., 2014. Surgical process modelling: a review. International journal of computer assisted radiology and surgery 9, 495–511

  30. [30]

    Pseudo-label: The simple and efficient semi- supervised learning method for deep neural networks, in: Workshop on challenges in representation learning, ICML, Atlanta

    Lee, D.H., et al., 2013. Pseudo-label: The simple and efficient semi- supervised learning method for deep neural networks, in: Workshop on challenges in representation learning, ICML, Atlanta. p. 896

  31. [31]

    Recurrent dynamicembeddingforvideoobjectsegmentation,in:Proceedingsof theIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion, pp

    Li,M.,Hu,L.,Xiong,Z.,Zhang,B.,Pan,P.,Liu,D.,2022. Recurrent dynamicembeddingforvideoobjectsegmentation,in:Proceedingsof theIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion, pp. 1332–1341

  32. [32]

    Drift robust non-rigid optical flowenhancementforlongsequences

    Li, W., Cosker, D., Brown, M., 2016. Drift robust non-rigid optical flowenhancementforlongsequences. JournalofIntelligent&Fuzzy Systems 31, 2583–2595

  33. [33]

    Deep learning for surgical workflow analysis: a survey of progresses, limitations, and trends

    Li, Y., Zhao, Z., Li, R., Li, F., 2024. Deep learning for surgical workflow analysis: a survey of progresses, limitations, and trends. Artificial Intelligence Review 57, 291

  34. [34]

    Surgical sam 2: Real-time segment anything in surgical video by efficient frame pruning

    Liu, H., Zhang, E., Wu, J., Hong, M., Jin, Y., 2024. Surgical sam 2: Real-time segment anything in surgical video by efficient frame pruning. arXiv preprint arXiv:2408.07931

  35. [35]

    Liu,Z.,Lin,Y.,Cao,Y.,Hu,H.,Wei,Y.,Zhang,Z.,Lin,S.,Guo,B.,

  36. [36]

    10012–10022

    Swintransformer:Hierarchicalvisiontransformerusingshifted windows, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022

  37. [37]

    Semi-supervised medical image segmentation via uncertainty rectified pyramid consistency

    Luo, X., Wang, G., Liao, W., Chen, J., Song, T., Chen, Y., Zhang, S., Metaxas, D.N., Zhang, S., 2022. Semi-supervised medical image segmentation via uncertainty rectified pyramid consistency. Medical Image Analysis 80, 102517

  38. [38]

    Maier-Hein,L.,Vedula,S.S.,Speidel,S.,Navab,N.,Kikinis,R.,Park, A., Eisenmann, M., Feussner, H., Forestier, G., Giannarou, S., et al.,

  39. [39]

    Nature Biomedical Engineering 1, 691–696

    Surgicaldatasciencefornext-generationinterventions. Nature Biomedical Engineering 1, 691–696

  40. [40]

    Robotic surgery: applications, limitations, and impact on surgical education

    Morris, B., 2005. Robotic surgery: applications, limitations, and impact on surgical education. Medscape General Medicine 7, 72

  41. [41]

    Joint-task regulariza- tion for partially labeled multi-task learning, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp

    Nishi, K., Kim, J., Li, W., Pfister, H., 2024. Joint-task regulariza- tion for partially labeled multi-task learning, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp. 16152–16162

  42. [42]

    Nwoye, C.I., Gonzalez, C., Yu, T., Mascagni, P., Mutter, D., Marescaux, J., Padoy, N., 2020. Recognition of instrument-tissue interactions in endoscopic videos via action triplets, in: International conferenceonmedicalimagecomputingandcomputer-assistedinter- vention, Springer. pp. 364–374

  43. [43]

    Video object seg- mentationusingspace-timememorynetworks,in:Proceedingsofthe IEEE/CVF international conference on computer vision, pp

    Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J., 2019. Video object seg- mentationusingspace-timememorynetworks,in:Proceedingsofthe IEEE/CVF international conference on computer vision, pp. 9226– 9235

  44. [44]

    The 2017 davis challenge on video object segmentation

    Pont-Tuset,J.,Perazzi,F.,Caelles,S.,Arbeláez,P.,Sorkine-Hornung, A., Van Gool, L., 2017. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675

  45. [45]

    Robust in- stancetrackingviauncertaintyflow.arXivpreprintarXiv:2010.04367

    Qian, J., Nan, J., Ancha, S., Okorn, B., Held, D., 2020. Robust in- stancetrackingviauncertaintyflow.arXivpreprintarXiv:2010.04367

  46. [46]

    Weakly supervised temporal convolutional networks for fine-grained surgical activity recognition

    Ramesh,S.,Dall’Alba,D.,Gonzalez,C.,Yu,T.,Mascagni,P.,Mutter, D., Marescaux, J., Fiorini, P., Padoy, N., 2023. Weakly supervised temporal convolutional networks for fine-grained surgical activity recognition. IEEE Transactions on Medical Imaging 42, 2592–2602

  47. [47]

    Sam 2: Segmentanythinginimagesandvideos,in:InternationalConference on Learning Representations, pp

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al., 2025. Sam 2: Segmentanythinginimagesandvideos,in:InternationalConference on Learning Representations, pp. 28085–28128

  48. [48]

    Unsupervised learningofopticalflowwithpatchconsistencyandocclusionestima- tion

    Ren, Z., Yan, J., Yang, X., Yuille, A., Zha, H., 2020. Unsupervised learningofopticalflowwithpatchconsistencyandocclusionestima- tion. Pattern Recognition 103, 107191

  49. [49]

    Rivoir, D., Bodenstedt, S., Funke, I., von Bechtolsheim, F., Distler, M., Weitz, J., Speidel, S., 2020. Rethinking anticipation tasks: Uncertainty-aware anticipation of sparse surgical instrument usage for context-aware assistance, in: International conference on medical image computing and computer-assisted intervention, Springer. pp. 752–762

  50. [50]

    U-net: Convolutional networks for biomedical image segmentation, in: International Con- ferenceonMedicalimagecomputingandcomputer-assistedinterven- tion, Springer

    Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation, in: International Con- ferenceonMedicalimagecomputingandcomputer-assistedinterven- tion, Springer. pp. 234–241

  51. [51]

    An overview of multi-task learning in deep neural networks

    Ruder, S., 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098

  52. [52]

    Evaluation of extra pixel interpolation with maskprocessingformedicalimagesegmentationwithdeeplearning

    Rukundo, O., 2024. Evaluation of extra pixel interpolation with maskprocessingformedicalimagesegmentationwithdeeplearning. Signal, Image and Video Processing 18, 7703–7710

  53. [53]

    Robotic surgery

    Schreuder, H., Verheijen, R., 2009. Robotic surgery. BJOG: An International Journal of Obstetrics & Gynaecology 116, 198–213

  54. [54]

    Fun-sis: A fully unsupervised approach for surgical instrument seg- mentation

    Sestini, L., Rosa, B., De Momi, E., Ferrigno, G., Padoy, N., 2023. Fun-sis: A fully unsupervised approach for surgical instrument seg- mentation. Medical Image Analysis 85, 102751

  55. [55]

    Hierarchical image saliency detection on extended cssd

    Shi, J., Yan, Q., Xu, L., Jia, J., 2015. Hierarchical image saliency detection on extended cssd. IEEE transactions on pattern analysis and machine intelligence 38, 717–729

  56. [56]

    Semi-supervisedlearning withprogressiveunlabeleddataexcavationforlabel-efficientsurgical workflow recognition

    Shi,X.,Jin,Y.,Dou,Q.,Heng,P.A.,2021. Semi-supervisedlearning withprogressiveunlabeleddataexcavationforlabel-efficientsurgical workflow recognition. Medical Image Analysis 73, 102158. Page 16 of 17 Temporally Consistent Label Interpolation for Robust Surgical Multi-Task Learning under Challenging Conditions

  57. [57]

    Auto- matic instrument segmentation in robot-assisted surgery using deep learning, in: 2018 17th IEEE international conference on machine learning and applications (ICMLA), IEEE

    Shvets,A.A.,Rakhlin,A.,Kalinin,A.A.,Iglovikov,V.I.,2018. Auto- matic instrument segmentation in robot-assisted surgery using deep learning, in: 2018 17th IEEE international conference on machine learning and applications (ICMLA), IEEE. pp. 624–628

  58. [58]

    Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results

    Tarvainen, A., Valpola, H., 2017. Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems 30

  59. [59]

    Raft:Recurrentall-pairsfieldtransformsfor optical flow, in: European conference on computer vision, Springer

    Teed,Z.,Deng,J.,2020. Raft:Recurrentall-pairsfieldtransformsfor optical flow, in: European conference on computer vision, Springer. pp. 402–419

  60. [60]

    Is learning the n-th thing any easier than learning the first? Advances in neural information processing systems 8

    Thrun, S., 1995. Is learning the n-th thing any easier than learning the first? Advances in neural information processing systems 8

  61. [61]

    Endonet: A deep architecture for recognition tasks on laparoscopic videos

    Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy, N., 2016. Endonet: A deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging 36, 86–97

  62. [62]

    Towards holistic surgical scene understanding, in: International con- ferenceonmedicalimagecomputingandcomputer-assistedinterven- tion, Springer

    Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P., 2022. Towards holistic surgical scene understanding, in: International con- ferenceonmedicalimagecomputingandcomputer-assistedinterven- tion, Springer. pp. 442–452

  63. [63]

    Look before you match: Instance understanding matters in video object segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pp

    Wang,J.,Chen,D.,Wu,Z.,Luo,C.,Tang,C.,Dai,X.,Zhao,Y.,Xie, Y., Yuan, L., Jiang, Y.G., 2023. Look before you match: Instance understanding matters in video object segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pp. 2268–2278

  64. [64]

    Wang, Z., Lu, B., Long, Y., Zhong, F., Cheung, T.H., Dou, Q., Liu, Y., 2022. Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy, in: Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 486– 496

  65. [65]

    Segmatch: semi-supervised surgical instrument segmentation

    Wei, M., Budd, C., Garcia-Peraza-Herrera, L.C., Dorent, R., Shi, M., Vercauteren, T., 2025. Segmatch: semi-supervised surgical instrument segmentation. Scientific Reports 15, 14042

  66. [66]

    Accflow: Backward accumulation for long- range optical flow, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Wu, G., Liu, X., Luo, K., Liu, X., Zheng, Q., Liu, S., Jiang, X., Zhai, G., Wang, W., 2023. Accflow: Backward accumulation for long- range optical flow, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12119–12128

  67. [67]

    Appearance-based refinement for object-centric motion segmentation, in: European Conference on Computer Vision, Springer

    Xie, J., Xie, W., Zisserman, A., 2024. Appearance-based refinement for object-centric motion segmentation, in: European Conference on Computer Vision, Springer. pp. 238–256

  68. [68]

    Gmflow: Learning optical flow via global matching, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

    Xu, H., Zhang, J., Cai, J., Rezatofighi, H., Tao, D., 2022. Gmflow: Learning optical flow via global matching, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8121–8130

  69. [69]

    Hard frame detection and online mapping for surgical phase recognition, in: International Conference on Medical ImageComputingandComputer-AssistedIntervention,Springer.pp

    Yi, F., Jiang, T., 2019. Hard frame detection and online mapping for surgical phase recognition, in: International Conference on Medical ImageComputingandComputer-AssistedIntervention,Springer.pp. 449–457

  70. [70]

    Memory- augmentedsam2fortraining-freesurgicalvideosegmentation,in:In- ternationalConferenceonMedicalImageComputingandComputer- Assisted Intervention, Springer

    Yin, M., Wang, F., Ye, X., Meng, Y., Fu, Z., 2025. Memory- augmentedsam2fortraining-freesurgicalvideosegmentation,in:In- ternationalConferenceonMedicalImageComputingandComputer- Assisted Intervention, Springer. pp. 328–337

  71. [71]

    Yu, J., Wang, A., Dong, W., Xu, M., Islam, M., Wang, J., Bai, L., Ren, H., 2025. Sam 2 in robotic surgery: An empirical evaluation for robustness and generalization in surgical video segmentation, in: International Workshop on Efficient Medical Artificial Intelligence, Springer. pp. 174–183

  72. [72]

    Yu, Y., Zhao, Z., Jin, Y., Chen, G., Dou, Q., Heng, P.A., 2022. Pseudo-label guided cross-video pixel contrast for robotic surgical scene segmentation with limited annotations, in: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE. pp. 10857–10864

  73. [73]

    Semisam+: rethinking semi-supervised medical image segmentation in the era of foundation models

    Zhang, Y., Lv, B., Xue, L., Zhang, W., Liu, Y., Fu, Y., Cheng, Y., Qi, Y., 2025. Semisam+: rethinking semi-supervised medical image segmentation in the era of foundation models. Medical Image Analysis , 103733

  74. [74]

    Nasalseg: A dataset for automatic segmentationofnasalcavityandparanasalsinusesfrom3dctimages

    Zhang, Y., Wang, J., Pan, T., Jiang, Q., Ge, J., Guo, X., Jiang, C., Lu, J., Zhang, J., Liu, X., et al., 2024. Nasalseg: A dataset for automatic segmentationofnasalcavityandparanasalsinusesfrom3dctimages. Scientific Data 11, 1329

  75. [75]

    A survey on multi-task learning

    Zhang, Y., Yang, Q., 2021. A survey on multi-task learning. IEEE transactions on knowledge and data engineering 34, 5586–5609

  76. [76]

    Zhao, Z., Jin, Y., Gao, X., Dou, Q., Heng, P.A., 2020. Learn- ing motion flows for semi-supervised instrument segmentation from roboticsurgicalvideo,in:InternationalConferenceonMedicalImage Computing and Computer-Assisted Intervention, Springer. pp. 679– 689. Page 17 of 17