pith. sign in

arxiv: 1907.11454 · v1 · pith:UNVMOGOFnew · submitted 2019-07-26 · 💻 cs.CV · cs.LG

Using 3D Convolutional Neural Networks to Learn Spatiotemporal Features for Automatic Surgical Gesture Recognition in Video

Pith reviewed 2026-05-24 15:57 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords surgical gesture recognition3D CNNspatiotemporal featuresJIGSAWS datasetlaparoscopic videorobot-assisted surgeryvideo classification
0
0 comments X

The pith

A 3D CNN learns joint spatiotemporal features from video frames to recognize surgical gestures at over 84 percent frame-wise accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes training a 3D convolutional neural network on sequences of laparoscopic video frames to extract features that combine spatial appearance and temporal motion in one step. This is tested on the JIGSAWS recordings of robot-assisted suturing tasks. The resulting model reaches frame-by-frame gesture classification accuracy above 84 percent and beats both purely spatial networks and networks that handle space and time in separate stages. A reader would care because the method works from ordinary video alone, without extra sensors, opening routes to automatic skill evaluation and step monitoring during surgery.

Core claim

The central claim is that a 3D CNN trained directly on consecutive video frames can learn spatiotemporal features for surgical gesture recognition, producing frame-wise accuracies of more than 84 percent on the JIGSAWS robot-assisted suturing recordings and outperforming models that extract only spatial features or treat spatial and low-level temporal information separately.

What carries the argument

3D Convolutional Neural Network that takes stacks of consecutive video frames as input and jointly computes spatial and temporal filters across the volume.

If this is right

  • Video-only gesture recognition becomes feasible at low cost in any operating room equipped with a standard laparoscope.
  • Automatic skill assessment and intra-operative alerts for critical steps can be built without attaching extra tracking hardware to the instruments.
  • The same architecture can be retrained on other recorded procedures once labeled gesture data for those tasks exists.
  • Real-time deployment on surgical video streams becomes practical if the 3D network is optimized for inference speed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the input stack length or adding recurrent layers on top of the 3D features might capture longer-range dependencies that single 3D convolutions miss.
  • The approach could be tested on full-length procedures rather than short bench-top suturing clips to check whether performance holds when gesture transitions are rarer.
  • Combining the learned spatiotemporal features with kinematic data from the robot, when available, would likely raise accuracy further but would lose the pure-video advantage.

Load-bearing premise

The reported accuracy advantage is caused by the joint spatiotemporal modeling inside the 3D CNN rather than by differences in network size, training schedule, or preprocessing steps.

What would settle it

An ablation study that trains a 2D CNN baseline and a 3D CNN on identical network capacity, identical data splits, and identical optimization settings, then measures whether the 3D version still shows a clear accuracy gain on the same suturing videos.

Figures

Figures reproduced from arXiv: 1907.11454 by Felix von Bechtolsheim, Florian Oehme, Isabel Funke, J\"urgen Weitz, Sebastian Bodenstedt, Stefanie Speidel.

Figure 1
Figure 1. Figure 1: Surgical gestures defined for the JIGSAWS suturing task [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative results. We depict the surgical gesture estimates obtained by [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evaluating the sliding window approach with varying [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
read the original abstract

Automatically recognizing surgical gestures is a crucial step towards a thorough understanding of surgical skill. Possible areas of application include automatic skill assessment, intra-operative monitoring of critical surgical steps, and semi-automation of surgical tasks. Solutions that rely only on the laparoscopic video and do not require additional sensor hardware are especially attractive as they can be implemented at low cost in many scenarios. However, surgical gesture recognition based only on video is a challenging problem that requires effective means to extract both visual and temporal information from the video. Previous approaches mainly rely on frame-wise feature extractors, either handcrafted or learned, which fail to capture the dynamics in surgical video. To address this issue, we propose to use a 3D Convolutional Neural Network (CNN) to learn spatiotemporal features from consecutive video frames. We evaluate our approach on recordings of robot-assisted suturing on a bench-top model, which are taken from the publicly available JIGSAWS dataset. Our approach achieves high frame-wise surgical gesture recognition accuracies of more than 84%, outperforming comparable models that either extract only spatial features or model spatial and low-level temporal information separately. For the first time, these results demonstrate the benefit of spatiotemporal CNNs for video-based surgical gesture recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes using 3D Convolutional Neural Networks to learn joint spatiotemporal features directly from sequences of laparoscopic video frames for frame-wise surgical gesture recognition. It evaluates the method on robot-assisted suturing recordings from the public JIGSAWS dataset and reports frame-wise accuracies above 84%, claiming outperformance over models that extract only spatial features or handle spatial and temporal information separately.

Significance. If the performance advantage can be rigorously attributed to the 3D CNN architecture, the result would be significant for video-based surgical analysis by showing that end-to-end spatiotemporal feature learning improves gesture recognition without additional sensors. The work addresses a practical clinical problem and uses a public benchmark, but the current evidence does not yet isolate the contribution of joint spatiotemporal modeling.

major comments (2)
  1. [Abstract] Abstract: the claim that the 3D CNN 'outperforms comparable models that either extract only spatial features or model spatial and low-level temporal information separately' cannot be evaluated because no architecture specifications, parameter counts, training schedules, or preprocessing details are supplied for the baselines; without these, the accuracy difference cannot be attributed to joint spatiotemporal learning.
  2. [Abstract] Abstract / Results: accuracies are stated as 'more than 84%' with no error bars, no statistical significance tests against baselines, and no ablation studies that hold network capacity or optimization procedure fixed; this leaves the central outperformance claim only partially supported.
minor comments (1)
  1. [Abstract] Abstract: the statement 'For the first time, these results demonstrate the benefit...' should be accompanied by citations to prior spatiotemporal CNN applications in other video domains to avoid overstatement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on our manuscript. We address the major comments point-by-point below and propose revisions where appropriate to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the 3D CNN 'outperforms comparable models that either extract only spatial features or model spatial and low-level temporal information separately' cannot be evaluated because no architecture specifications, parameter counts, training schedules, or preprocessing details are supplied for the baselines; without these, the accuracy difference cannot be attributed to joint spatiotemporal learning.

    Authors: The full manuscript includes detailed descriptions of the baseline architectures (2D CNN and hybrid models), their parameter counts, training schedules, and preprocessing steps in Sections 3 and 4. The abstract summarizes the key finding, but to address this concern, we will revise the abstract to explicitly state that the baselines were implemented with comparable model capacities and trained using identical procedures and data splits. This will allow readers to better evaluate the attribution to joint spatiotemporal learning. revision: yes

  2. Referee: [Abstract] Abstract / Results: accuracies are stated as 'more than 84%' with no error bars, no statistical significance tests against baselines, and no ablation studies that hold network capacity or optimization procedure fixed; this leaves the central outperformance claim only partially supported.

    Authors: We agree that including error bars, statistical tests, and capacity-controlled ablations would provide stronger support. The manuscript reports average accuracies over multiple folds, but we did not include variance measures or significance tests in the abstract or main results table. We will revise the results section to include standard deviations across runs, perform paired t-tests or similar for significance against baselines, and add an ablation study that matches network capacity and optimization settings. These changes will be incorporated in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on public dataset with no derivations or self-referential claims

full rationale

The paper is an empirical ML study proposing 3D CNNs for video-based surgical gesture recognition and reporting frame-wise accuracies >84% on the JIGSAWS dataset, outperforming spatial-only or separate spatial+temporal models. No equations, parameter fits presented as predictions, uniqueness theorems, or ansatzes are present in the provided text. The central claim rests on direct experimental comparison against baselines on an external public benchmark, with no reduction of results to self-definitions or self-citations. This is self-contained empirical work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard assumption that 3D CNNs can learn useful spatiotemporal features from video when trained on the JIGSAWS dataset; network weights are learned from data.

free parameters (1)
  • 3D CNN weights and hyperparameters
    Learned during supervised training on the dataset; typical for deep learning models.
axioms (1)
  • domain assumption 3D convolutions can jointly capture spatial and temporal information from video sequences
    Invoked as the core reason the method outperforms prior frame-wise extractors.

pith-pipeline@v0.9.0 · 5773 in / 1186 out tokens · 24691 ms · 2026-05-24T15:57:13.467033+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    IEEE Trans Biomed Eng 64(9), 2025–2041 (2017)

    Ahmidi, N., Tao, L., Sefati, S., Gao, Y., Lea, C., Haro, B.B., et al.: A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery. IEEE Trans Biomed Eng 64(9), 2025–2041 (2017)

  2. [2]

    In: CVPR

    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the Kinetics dataset. In: CVPR. pp. 4724–4733. IEEE (2017)

  3. [3]

    In: MICCAI

    DiPietro, R., Lea, C., Malpani, A., Ahmidi, N., Vedula, S.S., Lee, G.I., et al.: Recognizing surgical activities with recurrent neural networks. In: MICCAI. pp. 551–558. Springer, Cham (2016)

  4. [4]

    In: ICCV-W

    Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3D resid- ual networks for action recognition. In: ICCV-W. pp. 3154–3160. IEEE (2017)

  5. [5]

    In: CVPR

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778. IEEE (2016)

  6. [6]

    IEEE Trans Pattern Anal Mach Intell 35(1), 221–231 (2013)

    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1), 221–231 (2013)

  7. [7]

    In: ICLR (2015) 3D CNNs for Automatic Surgical Gesture Recognition in Video 9

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015) 3D CNNs for Automatic Surgical Gesture Recognition in Video 9

  8. [8]

    In: CVPR

    Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: CVPR. pp. 156–165. IEEE (2017)

  9. [9]

    In: ECCV

    Lea, C., Reiter, A., Vidal, R., Hager, G.D.: Segmental spatiotemporal CNNs for fine-grained action segmentation. In: ECCV. pp. 36–52. Springer, Cham (2016)

  10. [10]

    In: ECCV-W

    Lea, C., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks: A unified approach to action segmentation. In: ECCV-W. pp. 47–54. Springer, Cham (2016)

  11. [11]

    In: MICCAI

    Liu, D., Jiang, T.: Deep reinforcement learning for surgical gesture segmentation and classification. In: MICCAI. pp. 247–255. Springer, Cham (2018)

  12. [12]

    In: IPCAI

    Tao, L., Elhamifar, E., Khudanpur, S., Hager, G.D., Vidal, R.: Sparse hidden markov models for surgical gesture classification and skill evaluation. In: IPCAI. pp. 167–177. Springer, Berlin, Heidelberg (2012)

  13. [13]

    In: MICCAI

    Tao, L., Zappella, L., Hager, G.D., Vidal, R.: Surgical gesture segmentation and recognition. In: MICCAI. pp. 339–346. Springer, Berlin, Heidelberg (2013)

  14. [14]

    In: ECCV

    Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., et al.: Temporal segment networks: Towards good practices for deep action recognition. In: ECCV. pp. 20–36. Springer, Cham (2016) Supplementary Using 3D Convolutional Neural Networks to Learn Spatiotemporal Features for Automatic Surgical Gesture Recognition in Video Isabel Funke1, Sebastian Bode...