pith. sign in

arxiv: 1907.01847 · v1 · pith:VAC4PO2Knew · submitted 2019-07-03 · 💻 cs.CV · eess.IV

Deformable Tube Network for Action Detection in Videos

Pith reviewed 2026-05-25 10:38 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords action detectionvideo analysisdeformable tubes3D convolutionspatio-temporal detectionproposal linkingUCF-SportsAVA
0
0 comments X

The pith

Deformable action tubes generated by linking frame proposals outperform 3D cuboids in video action detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a two-stage detector called Deformable Tube Network that first generates flexible tube-shaped proposals for actions across video frames and then classifies them with a 3D convolutional network. This approach explicitly models the changing shapes of actions instead of using fixed 3D boxes. A sympathetic reader would care because better modeling of action shapes could lead to more accurate detection of human activities in videos. The method achieves state-of-the-art results on UCF-Sports and AVA datasets by outperforming previous cuboid-based methods.

Core claim

The Deformable Tube Network consists of a Deformation Tube Proposal Network that uses a fast proposal linking algorithm to connect region proposals across frames into multiple deformable action tube proposals, and a Deformable Tube Recognition Network that employs a 3D convolution network with skip connections to perform tube classification and regression. Modelling action proposals as deformable tubes allows explicit consideration of action tube shapes compared to 3D cuboids, and the 3D convolution network learns temporal dynamics sufficiently for action detection.

What carries the argument

Deformable action tube proposals generated by linking region proposals across frames using the fast proposal linking algorithm in the Deformation Tube Proposal Network.

If this is right

  • Significantly outperforms methods using 3D cuboids for action detection.
  • Achieves state-of-the-art results on the UCF-Sports dataset.
  • Achieves state-of-the-art results on the AVA dataset.
  • 3D convolution based recognition learns temporal dynamics for better detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If deformable tubes better capture varying shapes, similar linking methods could improve other video understanding tasks like object tracking.
  • The approach may allow detection of actions with complex motions that rigid cuboids miss.
  • Extending the fast linking algorithm to longer videos could test scalability.

Load-bearing premise

The fast proposal linking algorithm produces deformable tube proposals that accurately capture the varying shapes of actions across frames.

What would settle it

Running the detector on a new dataset with actions that change shape dramatically between frames and finding no improvement over 3D cuboid methods would challenge the claim.

Figures

Figures reproduced from arXiv: 1907.01847 by Changhu Wang, Dashan Guo, Lei Huang, Wei Li, Xiangzhong Fang, Zehuan Yuan.

Figure 1
Figure 1. Figure 1: The overall architecture of our proposal two-stage action localization model with DTPN and DTRN. We link per-frame proposals [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-category AP on AVA dataset: baseline model, baseline-multi model and Our DTN. Categories are sorted by the number [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of Five detection examples from UCF-Sports dataset. Blue boxes indicate model detections and red boxes denote [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization examples on AVA dataset. Blue boxes indicate model predictions and red boxes denote ground truths. The ground [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of linked tube examples with our DTPN. The green boxes represent region proposals of linked tubes and the red [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

We address the problem of spatio-temporal action detection in videos. Existing methods commonly either ignore temporal context in action recognition and localization, or lack the modelling of flexible shapes of action tubes. In this paper, we propose a two-stage action detector called Deformable Tube Network (DTN), which is composed of a Deformation Tube Proposal Network (DTPN) and a Deformable Tube Recognition Network (DTRN) similar to the Faster R-CNN architecture. In DTPN, a fast proposal linking algorithm (FTL) is introduced to connect region proposals across frames to generate multiple deformable action tube proposals. To perform action detection, we design a 3D convolution network with skip connections for tube classification and regression. Modelling action proposals as deformable tubes explicitly considers the shape of action tubes compared to 3D cuboids. Moreover, 3D convolution based recognition network can learn temporal dynamics sufficiently for action detection. Our experimental results show that we significantly outperform the methods with 3D cuboids and obtain the state-of-the-art results on both UCF-Sports and AVA datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes the Deformable Tube Network (DTN), a two-stage detector analogous to Faster R-CNN, consisting of a Deformation Tube Proposal Network (DTPN) that employs a fast proposal linking (FTL) algorithm to connect per-frame region proposals into deformable action tube proposals, followed by a Deformable Tube Recognition Network (DTRN) that applies 3D convolutions with skip connections for tube classification and regression. The central claim is that explicitly modeling flexible tube shapes (rather than fixed 3D cuboids) combined with sufficient temporal modeling yields significant outperformance over cuboid-based methods and state-of-the-art results on the UCF-Sports and AVA datasets.

Significance. If the empirical claims are substantiated, the work would advance spatio-temporal action detection by replacing rigid cuboid proposals with deformable tubes that better accommodate varying action shapes across frames. The combination of proposal linking with 3D-convolutional recognition is a natural extension of existing two-stage detectors and could improve localization accuracy on benchmarks where actions exhibit non-rigid motion.

major comments (1)
  1. [Abstract] Abstract: the assertion that the method 'significantly outperform[s] the methods with 3D cuboids and obtain[s] the state-of-the-art results on both UCF-Sports and AVA datasets' supplies no quantitative metrics, baseline names, dataset splits, ablation results, or error bars. Because the paper's contribution is framed entirely as an empirical improvement, the absence of these supporting data is load-bearing for the central claim.
minor comments (1)
  1. [Abstract] Abstract: the description of the FTL linking step is limited to a single sentence; a brief statement of its computational complexity or linking criterion would clarify how the deformable tubes are generated before the reader reaches the methods section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The single major comment concerns the abstract's lack of supporting quantitative details for the empirical claims. We address this point below and agree that a revision to the abstract is warranted.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that the method 'significantly outperform[s] the methods with 3D cuboids and obtain[s] the state-of-the-art results on both UCF-Sports and AVA datasets' supplies no quantitative metrics, baseline names, dataset splits, ablation results, or error bars. Because the paper's contribution is framed entirely as an empirical improvement, the absence of these supporting data is load-bearing for the central claim.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised version we will update the abstract to report the primary frame-mAP numbers on UCF-Sports and AVA, the main competing baselines, and the standard dataset splits used. Space constraints preclude embedding full ablation tables or per-run error bars in the abstract; those results are already presented with full detail (including standard deviations where computed) in Section 4. This targeted revision will make the central empirical claim self-contained while preserving the abstract's readability. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on experiments

full rationale

The paper describes a two-stage neural architecture (DTPN with FTL linking to produce deformable tubes, followed by 3D-conv DTRN) whose central claims are empirical outperformance on UCF-Sports and AVA. No equations, first-principles derivations, or predictions appear that reduce to inputs by construction. Performance assertions are supported by reported results rather than self-referential fitting or self-citation chains. The work is self-contained as standard empirical CV research with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; 'deformable tubes' is presented as a modeling choice rather than a new postulated physical entity.

pith-pipeline@v0.9.0 · 5734 in / 1099 out tokens · 59964 ms · 2026-05-25T10:38:03.098199+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 1 internal anchor

  1. [1]

    L. Cao, Z. Liu, and T. S. Huang. Cross-dataset action de- tection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  2. [2]

    Mxnet: A flexible and efficient machine learn- ing library for heterogeneous distributed systems

    Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learn- ing library for heterogeneous distributed systems. 2015

  3. [3]

    Actor-centric re- lation network

    Carl V ondrick Kevin Murphy Rahul Sukthankar Chen Sun, Abhinav Shrivastava and Cordelia Schmid. Actor-centric re- lation network. In European Conference on Computer Vision (ECCV), 2018

  4. [4]

    Long-term recurrent convolutional net- works for visual recognition and description

    Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional net- works for visual recognition and description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 4332

  5. [5]

    VideoCapsuleNet: A Simplified Network for Action Detection

    Kevin Duarte, Yogesh Singh Rawat, and Mubarak Shah. Videocapsulenet: A simplified network for action detection. arXiv preprint arXiv:1805.08162, 2018

  6. [6]

    Slowfast networks for video recognition

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982, 2018

  7. [7]

    P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- manan. Object detection with discriminatively trained part- based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010

  8. [8]

    Bottom-up segmentation for top-down detection

    Sanja Fidler, Roozbeh Mottaghi, Alan Yuille, and Raquel Urtasun. Bottom-up segmentation for top-down detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013

  9. [9]

    G. D. Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 1973

  10. [10]

    K. Fu, Q. Zhao, and I. Y . Gu. Refinet: A deep segmen- tation assisted refinement network for salient object detec- tion. IEEE Transactions on Multimedia, 21(2):457–469, Feb 2019

  11. [11]

    Video action transformer network

    Rohit Girdhar, Jo ˜ao Carreira, Carl Doersch, and Andrew Zis- serman. Video action transformer network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  12. [12]

    Rich feature hierarchies for accurate object detec- tion and semantic segmentation

    Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detec- tion and semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014

  13. [13]

    Finding action tubes

    Georgia Gkioxari and Jitendra Malik. Finding action tubes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

  14. [14]

    Ross, Carl V on- drick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijaya- narasimhan, George Toderici, Susanna Ricco, Rahul Suk- thankar, Cordelia Schmid, and Jitendra Malik

    Chunhui Gu, Chen Sun, David A. Ross, Carl V on- drick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijaya- narasimhan, George Toderici, Susanna Ricco, Rahul Suk- thankar, Cordelia Schmid, and Jitendra Malik. Ava: A video dataset of spatio-temporally localized atomic visual actions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  15. [15]

    Ibrahim, Zhiwei Deng, and Greg Mori

    Jiawei He, Mostafa S. Ibrahim, Zhiwei Deng, and Greg Mori. Generic tubelet proposals for action localization. The IEEE Winter Conference on Applications of Computer Vision (WACV), 2018

  16. [16]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir- shick. Mask r-cnn. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017

  17. [17]

    Tube convolu- tional neural network (t-cnn) for action detection in videos

    Rui Hou, Chen Chen, and Mubarak Shah. Tube convolu- tional neural network (t-cnn) for action detection in videos. In The IEEE International Conference on Computer Vision (ICCV), 2017

  18. [18]

    Action tubelet detector for spatio- temporal action localization

    Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet detector for spatio- temporal action localization. InThe IEEE International Con- ference on Computer Vision (ICCV), 2017

  19. [19]

    Action Tubelet Detector for Spatio- Temporal Action Localization

    Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action Tubelet Detector for Spatio- Temporal Action Localization. In The IEEE International Conference on Computer Vision (ICCV), 2017

  20. [21]

    Tian Lan, Yang Wang, and G. Mori. Discriminative figure- centric models for joint action localization and recogni- tion. In The International Conference on Computer Vision (ICCV), 2011

  21. [22]

    Laptev, M

    I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008

  22. [23]

    Laptev and P

    I. Laptev and P. Perez. Retrieving actions in movies. In The IEEE International Conference on Computer Vision (ICCV), 2007

  23. [24]

    LeCun, B

    Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Compu- tation, 1(4):541–551, 1989

  24. [25]

    Re- current tubelet proposal and recognition networks for action detection

    Dong Li, Zhaofan Qiu, Qi Dai, Ting Yao, and Tao Mei. Re- current tubelet proposal and recognition networks for action detection. In Vittorio Ferrari, Martial Hebert, Cristian Smin- chisescu, and Yair Weiss, editors, European Conference on Computer Vision (ECCV), 2018

  25. [26]

    J. Li, X. Liang, J. Li, Y . Wei, T. Xu, J. Feng, and S. Yan. Mul- tistage object detection with group recursive learning. IEEE Transactions on Multimedia, 20(7):1645–1655, July 2018

  26. [27]

    Detnet: Design backbone for object detection

    Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Detnet: Design backbone for object detection. In The European Conference on Computer Vision (ECCV), 2018

  27. [28]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , July 2017

  28. [29]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017

  29. [30]

    Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. Ssd: Single shot multibox detector. In European Con- ference on Computer Vision (ECCV), 2016

  30. [31]

    Multi-region two- stream r-cnn for action detection

    Xiaojiang Peng and Cordelia Schmid. Multi-region two- stream r-cnn for action detection. In European Conference on Computer Vision (ECCV), 2016

  31. [32]

    Learning spatio- temporal representation with pseudo-3d residual networks

    Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio- temporal representation with pseudo-3d residual networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017

  32. [33]

    You only look once: Unified, real-time object de- tection

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  33. [34]

    Faster R-CNN: Towards real-time object detection with re- 4333 gion proposal networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- 4333 gion proposal networks. In Advances in Neural Information Processing Systems (NIPS), 2015

  34. [35]

    M. D. Rodriguez, J. Ahmed, and M. Shah. Action mach a spatio-temporal maximum average correlation height filter for action recognition. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008

  35. [36]

    Recognizing fine-grained and composite ac- tivities using hand-centric features and script data

    Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, and Bernt Schiele. Recognizing fine-grained and composite ac- tivities using hand-centric features and script data. Interna- tional Journal of Computer Vision (IJCV) , 119(3):346–373, Sep 2016

  36. [37]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015

  37. [38]

    Suman Saha, Gurkirt Singh, Michael Sapienza, Philip H. S. Torr, and Fabio Cuzzolin. Deep learning for detecting multi- ple space-time action tubes in videos. 2016

  38. [39]

    Simonyan and A

    K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (NIPS), 2014

  39. [40]

    Online real time multiple spatiotempo- ral action localisation and prediction

    Gurkirt Singh, Suman Saha, Michael Sapienza, Philip Torr, and Fabio Cuzzolin. Online real time multiple spatiotempo- ral action localisation and prediction. 2017

  40. [41]

    Khurram Soomro and Amir R. Zamir. Action Recognition in Realistic Sports Videos, pages 181–208. Springer Interna- tional Publishing, Cham, 2014

  41. [42]

    M. A. Tahir, F. Yan, P. Koniusz, M. Awais, M. Barnard, K. Mikolajczyk, A. Bouridane, and J. Kittler. A robust and scalable visual category and action recognition system using kernel discriminant analysis with spectral regression. IEEE Transactions on Multimedia, 15(7):1653–1664, Nov 2013

  42. [43]

    A closer look at spatiotemporal convolutions for action recognition

    Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2018

  43. [44]

    J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recog- nition. International Journal of Computer Vision (IJCV) , 104(2):154–171, Sep 2013

  44. [45]

    Action recognition with improved trajectories

    Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In The IEEE International Conference on Computer Vision (ICCV), 2013

  45. [46]

    Regionlets for generic object detection

    Xiaoyu Wang, Ming Yang, Shenghuo Zhu, and Yuanqing Lin. Regionlets for generic object detection. In The IEEE International Conference on Computer Vision (ICCV) , De- cember 2013

  46. [47]

    Learning to track for spatio-temporal action local- ization

    Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. Learning to track for spatio-temporal action local- ization. In The IEEE International Conference on Computer Vision (ICCV), 2015

  47. [48]

    Long-Term Feature Banks for Detailed Video Understanding

    Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaim- ing He, Philipp Kr¨ahenb¨uhl, and Ross Girshick. Long-Term Feature Banks for Detailed Video Understanding. In The IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2019

  48. [49]

    Zhang, C

    S. Zhang, C. Gao, J. Zhang, F. Chen, and N. Sang. Discrim- inative part selection for human action recognition. IEEE Transactions on Multimedia, 20(4):769–780, April 2018

  49. [50]

    X. Zhen, F. Zheng, L. Shao, X. Cao, and D. Xu. Supervised local descriptor learning for human action recognition.IEEE Transactions on Multimedia, 19(9):2056–2065, Sep. 2017. 4334