Deformable Tube Network for Action Detection in Videos

Changhu Wang; Dashan Guo; Lei Huang; Wei Li; Xiangzhong Fang; Zehuan Yuan

arxiv: 1907.01847 · v1 · pith:VAC4PO2Knew · submitted 2019-07-03 · 💻 cs.CV · eess.IV

Deformable Tube Network for Action Detection in Videos

Wei Li , Zehuan Yuan , Dashan Guo , Lei Huang , Xiangzhong Fang , Changhu Wang This is my paper

Pith reviewed 2026-05-25 10:38 UTC · model grok-4.3

classification 💻 cs.CV eess.IV

keywords action detectionvideo analysisdeformable tubes3D convolutionspatio-temporal detectionproposal linkingUCF-SportsAVA

0 comments

The pith

Deformable action tubes generated by linking frame proposals outperform 3D cuboids in video action detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a two-stage detector called Deformable Tube Network that first generates flexible tube-shaped proposals for actions across video frames and then classifies them with a 3D convolutional network. This approach explicitly models the changing shapes of actions instead of using fixed 3D boxes. A sympathetic reader would care because better modeling of action shapes could lead to more accurate detection of human activities in videos. The method achieves state-of-the-art results on UCF-Sports and AVA datasets by outperforming previous cuboid-based methods.

Core claim

The Deformable Tube Network consists of a Deformation Tube Proposal Network that uses a fast proposal linking algorithm to connect region proposals across frames into multiple deformable action tube proposals, and a Deformable Tube Recognition Network that employs a 3D convolution network with skip connections to perform tube classification and regression. Modelling action proposals as deformable tubes allows explicit consideration of action tube shapes compared to 3D cuboids, and the 3D convolution network learns temporal dynamics sufficiently for action detection.

What carries the argument

Deformable action tube proposals generated by linking region proposals across frames using the fast proposal linking algorithm in the Deformation Tube Proposal Network.

If this is right

Significantly outperforms methods using 3D cuboids for action detection.
Achieves state-of-the-art results on the UCF-Sports dataset.
Achieves state-of-the-art results on the AVA dataset.
3D convolution based recognition learns temporal dynamics for better detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If deformable tubes better capture varying shapes, similar linking methods could improve other video understanding tasks like object tracking.
The approach may allow detection of actions with complex motions that rigid cuboids miss.
Extending the fast linking algorithm to longer videos could test scalability.

Load-bearing premise

The fast proposal linking algorithm produces deformable tube proposals that accurately capture the varying shapes of actions across frames.

What would settle it

Running the detector on a new dataset with actions that change shape dramatically between frames and finding no improvement over 3D cuboid methods would challenge the claim.

Figures

Figures reproduced from arXiv: 1907.01847 by Changhu Wang, Dashan Guo, Lei Huang, Wei Li, Xiangzhong Fang, Zehuan Yuan.

**Figure 1.** Figure 1: The overall architecture of our proposal two-stage action localization model with DTPN and DTRN. We link per-frame proposals [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Per-category AP on AVA dataset: baseline model, baseline-multi model and Our DTN. Categories are sorted by the number [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of Five detection examples from UCF-Sports dataset. Blue boxes indicate model detections and red boxes denote [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization examples on AVA dataset. Blue boxes indicate model predictions and red boxes denote ground truths. The ground [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of linked tube examples with our DTPN. The green boxes represent region proposals of linked tubes and the red [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

We address the problem of spatio-temporal action detection in videos. Existing methods commonly either ignore temporal context in action recognition and localization, or lack the modelling of flexible shapes of action tubes. In this paper, we propose a two-stage action detector called Deformable Tube Network (DTN), which is composed of a Deformation Tube Proposal Network (DTPN) and a Deformable Tube Recognition Network (DTRN) similar to the Faster R-CNN architecture. In DTPN, a fast proposal linking algorithm (FTL) is introduced to connect region proposals across frames to generate multiple deformable action tube proposals. To perform action detection, we design a 3D convolution network with skip connections for tube classification and regression. Modelling action proposals as deformable tubes explicitly considers the shape of action tubes compared to 3D cuboids. Moreover, 3D convolution based recognition network can learn temporal dynamics sufficiently for action detection. Our experimental results show that we significantly outperform the methods with 3D cuboids and obtain the state-of-the-art results on both UCF-Sports and AVA datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper swaps rigid 3D cuboids for deformable tubes linked across frames via a fast algorithm, then classifies them with 3D convs and skips, claiming SOTA on UCF-Sports and AVA.

read the letter

The punchline here is that the paper proposes using deformable action tubes instead of rigid 3D cuboids by linking region proposals with a fast algorithm in the first stage and then applying 3D convolutions with skip connections in the second stage for detection, with reported state-of-the-art performance on UCF-Sports and AVA. The work does a good job of identifying the gaps in previous approaches that either overlook temporal context or cannot model changing tube shapes. The two-stage structure is familiar from object detection but applied here to videos in a way that allows the tubes to deform. The fast proposal linking is presented as an efficient way to connect proposals temporally without heavy computation. What is actually new is this specific combination for generating and recognizing deformable tubes. It builds on existing 3D conv techniques but focuses the innovation on the proposal shape flexibility. The soft spots are mainly around the empirical support. The abstract makes strong claims about outperforming cuboid methods, but the details on exact metrics, baseline implementations, and ablations are not in the summary, so the full paper needs to show that the deformable aspect is what drives the improvement rather than other network choices. If the experiments are well-controlled, this would be fine, but it's a common area where papers can overclaim. The citation pattern appears typical for the field, referencing standard datasets and methods. This paper is for specialists in video action detection who are interested in alternatives to cuboid representations. It is not a foundational shift but a targeted improvement that could be useful to try. It deserves peer review because the idea is clear and the claims are specific enough to be checked against the experiments.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes the Deformable Tube Network (DTN), a two-stage detector analogous to Faster R-CNN, consisting of a Deformation Tube Proposal Network (DTPN) that employs a fast proposal linking (FTL) algorithm to connect per-frame region proposals into deformable action tube proposals, followed by a Deformable Tube Recognition Network (DTRN) that applies 3D convolutions with skip connections for tube classification and regression. The central claim is that explicitly modeling flexible tube shapes (rather than fixed 3D cuboids) combined with sufficient temporal modeling yields significant outperformance over cuboid-based methods and state-of-the-art results on the UCF-Sports and AVA datasets.

Significance. If the empirical claims are substantiated, the work would advance spatio-temporal action detection by replacing rigid cuboid proposals with deformable tubes that better accommodate varying action shapes across frames. The combination of proposal linking with 3D-convolutional recognition is a natural extension of existing two-stage detectors and could improve localization accuracy on benchmarks where actions exhibit non-rigid motion.

major comments (1)

[Abstract] Abstract: the assertion that the method 'significantly outperform[s] the methods with 3D cuboids and obtain[s] the state-of-the-art results on both UCF-Sports and AVA datasets' supplies no quantitative metrics, baseline names, dataset splits, ablation results, or error bars. Because the paper's contribution is framed entirely as an empirical improvement, the absence of these supporting data is load-bearing for the central claim.

minor comments (1)

[Abstract] Abstract: the description of the FTL linking step is limited to a single sentence; a brief statement of its computational complexity or linking criterion would clarify how the deformable tubes are generated before the reader reaches the methods section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The single major comment concerns the abstract's lack of supporting quantitative details for the empirical claims. We address this point below and agree that a revision to the abstract is warranted.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the method 'significantly outperform[s] the methods with 3D cuboids and obtain[s] the state-of-the-art results on both UCF-Sports and AVA datasets' supplies no quantitative metrics, baseline names, dataset splits, ablation results, or error bars. Because the paper's contribution is framed entirely as an empirical improvement, the absence of these supporting data is load-bearing for the central claim.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised version we will update the abstract to report the primary frame-mAP numbers on UCF-Sports and AVA, the main competing baselines, and the standard dataset splits used. Space constraints preclude embedding full ablation tables or per-run error bars in the abstract; those results are already presented with full detail (including standard deviations where computed) in Section 4. This targeted revision will make the central empirical claim self-contained while preserving the abstract's readability. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on experiments

full rationale

The paper describes a two-stage neural architecture (DTPN with FTL linking to produce deformable tubes, followed by 3D-conv DTRN) whose central claims are empirical outperformance on UCF-Sports and AVA. No equations, first-principles derivations, or predictions appear that reduce to inputs by construction. Performance assertions are supported by reported results rather than self-referential fitting or self-citation chains. The work is self-contained as standard empirical CV research with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; 'deformable tubes' is presented as a modeling choice rather than a new postulated physical entity.

pith-pipeline@v0.9.0 · 5734 in / 1099 out tokens · 59964 ms · 2026-05-25T10:38:03.098199+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fast proposal linking algorithm (FTL) ... deformable action tube proposals ... 3D convolution network with skip connections
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Modelling action proposals as deformable tubes explicitly considers the shape of action tubes compared to 3D cuboids

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 1 internal anchor

[1]

L. Cao, Z. Liu, and T. S. Huang. Cross-dataset action de- tection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page
[2]

Mxnet: A ﬂexible and efﬁcient machine learn- ing library for heterogeneous distributed systems

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A ﬂexible and efﬁcient machine learn- ing library for heterogeneous distributed systems. 2015

work page 2015
[3]

Actor-centric re- lation network

Carl V ondrick Kevin Murphy Rahul Sukthankar Chen Sun, Abhinav Shrivastava and Cordelia Schmid. Actor-centric re- lation network. In European Conference on Computer Vision (ECCV), 2018

work page 2018
[4]

Long-term recurrent convolutional net- works for visual recognition and description

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional net- works for visual recognition and description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 4332

work page 2015
[5]

VideoCapsuleNet: A Simplified Network for Action Detection

Kevin Duarte, Yogesh Singh Rawat, and Mubarak Shah. Videocapsulenet: A simpliﬁed network for action detection. arXiv preprint arXiv:1805.08162, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982, 2018

work page arXiv 2018
[7]

P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- manan. Object detection with discriminatively trained part- based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010

work page 2010
[8]

Bottom-up segmentation for top-down detection

Sanja Fidler, Roozbeh Mottaghi, Alan Yuille, and Raquel Urtasun. Bottom-up segmentation for top-down detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013

work page 2013
[9]

G. D. Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 1973

work page 1973
[10]

K. Fu, Q. Zhao, and I. Y . Gu. Reﬁnet: A deep segmen- tation assisted reﬁnement network for salient object detec- tion. IEEE Transactions on Multimedia, 21(2):457–469, Feb 2019

work page 2019
[11]

Video action transformer network

Rohit Girdhar, Jo ˜ao Carreira, Carl Doersch, and Andrew Zis- serman. Video action transformer network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[12]

Rich feature hierarchies for accurate object detec- tion and semantic segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detec- tion and semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014

work page 2014
[13]

Finding action tubes

Georgia Gkioxari and Jitendra Malik. Finding action tubes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

work page 2015
[14]

Ross, Carl V on- drick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijaya- narasimhan, George Toderici, Susanna Ricco, Rahul Suk- thankar, Cordelia Schmid, and Jitendra Malik

Chunhui Gu, Chen Sun, David A. Ross, Carl V on- drick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijaya- narasimhan, George Toderici, Susanna Ricco, Rahul Suk- thankar, Cordelia Schmid, and Jitendra Malik. Ava: A video dataset of spatio-temporally localized atomic visual actions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

work page 2018
[15]

Ibrahim, Zhiwei Deng, and Greg Mori

Jiawei He, Mostafa S. Ibrahim, Zhiwei Deng, and Greg Mori. Generic tubelet proposals for action localization. The IEEE Winter Conference on Applications of Computer Vision (WACV), 2018

work page 2018
[16]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir- shick. Mask r-cnn. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017

work page 2017
[17]

Tube convolu- tional neural network (t-cnn) for action detection in videos

Rui Hou, Chen Chen, and Mubarak Shah. Tube convolu- tional neural network (t-cnn) for action detection in videos. In The IEEE International Conference on Computer Vision (ICCV), 2017

work page 2017
[18]

Action tubelet detector for spatio- temporal action localization

Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet detector for spatio- temporal action localization. InThe IEEE International Con- ference on Computer Vision (ICCV), 2017

work page 2017
[19]

Action Tubelet Detector for Spatio- Temporal Action Localization

Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action Tubelet Detector for Spatio- Temporal Action Localization. In The IEEE International Conference on Computer Vision (ICCV), 2017

work page 2017
[21]

Tian Lan, Yang Wang, and G. Mori. Discriminative ﬁgure- centric models for joint action localization and recogni- tion. In The International Conference on Computer Vision (ICCV), 2011

work page 2011
[22]

Laptev, M

I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008

work page 2008
[23]

Laptev and P

I. Laptev and P. Perez. Retrieving actions in movies. In The IEEE International Conference on Computer Vision (ICCV), 2007

work page 2007
[24]

LeCun, B

Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Compu- tation, 1(4):541–551, 1989

work page 1989
[25]

Re- current tubelet proposal and recognition networks for action detection

Dong Li, Zhaofan Qiu, Qi Dai, Ting Yao, and Tao Mei. Re- current tubelet proposal and recognition networks for action detection. In Vittorio Ferrari, Martial Hebert, Cristian Smin- chisescu, and Yair Weiss, editors, European Conference on Computer Vision (ECCV), 2018

work page 2018
[26]

J. Li, X. Liang, J. Li, Y . Wei, T. Xu, J. Feng, and S. Yan. Mul- tistage object detection with group recursive learning. IEEE Transactions on Multimedia, 20(7):1645–1655, July 2018

work page 2018
[27]

Detnet: Design backbone for object detection

Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Detnet: Design backbone for object detection. In The European Conference on Computer Vision (ECCV), 2018

work page 2018
[28]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , July 2017

work page 2017
[29]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017

work page 2017
[30]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. Ssd: Single shot multibox detector. In European Con- ference on Computer Vision (ECCV), 2016

work page 2016
[31]

Multi-region two- stream r-cnn for action detection

Xiaojiang Peng and Cordelia Schmid. Multi-region two- stream r-cnn for action detection. In European Conference on Computer Vision (ECCV), 2016

work page 2016
[32]

Learning spatio- temporal representation with pseudo-3d residual networks

Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio- temporal representation with pseudo-3d residual networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017

work page 2017
[33]

You only look once: Uniﬁed, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uniﬁed, real-time object de- tection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[34]

Faster R-CNN: Towards real-time object detection with re- 4333 gion proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- 4333 gion proposal networks. In Advances in Neural Information Processing Systems (NIPS), 2015

work page 2015
[35]

M. D. Rodriguez, J. Ahmed, and M. Shah. Action mach a spatio-temporal maximum average correlation height ﬁlter for action recognition. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008

work page 2008
[36]

Recognizing ﬁne-grained and composite ac- tivities using hand-centric features and script data

Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, and Bernt Schiele. Recognizing ﬁne-grained and composite ac- tivities using hand-centric features and script data. Interna- tional Journal of Computer Vision (IJCV) , 119(3):346–373, Sep 2016

work page 2016
[37]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015

work page 2015
[38]

Suman Saha, Gurkirt Singh, Michael Sapienza, Philip H. S. Torr, and Fabio Cuzzolin. Deep learning for detecting multi- ple space-time action tubes in videos. 2016

work page 2016
[39]

Simonyan and A

K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (NIPS), 2014

work page 2014
[40]

Online real time multiple spatiotempo- ral action localisation and prediction

Gurkirt Singh, Suman Saha, Michael Sapienza, Philip Torr, and Fabio Cuzzolin. Online real time multiple spatiotempo- ral action localisation and prediction. 2017

work page 2017
[41]

Khurram Soomro and Amir R. Zamir. Action Recognition in Realistic Sports Videos, pages 181–208. Springer Interna- tional Publishing, Cham, 2014

work page 2014
[42]

M. A. Tahir, F. Yan, P. Koniusz, M. Awais, M. Barnard, K. Mikolajczyk, A. Bouridane, and J. Kittler. A robust and scalable visual category and action recognition system using kernel discriminant analysis with spectral regression. IEEE Transactions on Multimedia, 15(7):1653–1664, Nov 2013

work page 2013
[43]

A closer look at spatiotemporal convolutions for action recognition

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2018

work page 2018
[44]

J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recog- nition. International Journal of Computer Vision (IJCV) , 104(2):154–171, Sep 2013

work page 2013
[45]

Action recognition with improved trajectories

Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In The IEEE International Conference on Computer Vision (ICCV), 2013

work page 2013
[46]

Regionlets for generic object detection

Xiaoyu Wang, Ming Yang, Shenghuo Zhu, and Yuanqing Lin. Regionlets for generic object detection. In The IEEE International Conference on Computer Vision (ICCV) , De- cember 2013

work page 2013
[47]

Learning to track for spatio-temporal action local- ization

Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. Learning to track for spatio-temporal action local- ization. In The IEEE International Conference on Computer Vision (ICCV), 2015

work page 2015
[48]

Long-Term Feature Banks for Detailed Video Understanding

Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaim- ing He, Philipp Kr¨ahenb¨uhl, and Ross Girshick. Long-Term Feature Banks for Detailed Video Understanding. In The IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2019

work page 2019
[49]

Zhang, C

S. Zhang, C. Gao, J. Zhang, F. Chen, and N. Sang. Discrim- inative part selection for human action recognition. IEEE Transactions on Multimedia, 20(4):769–780, April 2018

work page 2018
[50]

X. Zhen, F. Zheng, L. Shao, X. Cao, and D. Xu. Supervised local descriptor learning for human action recognition.IEEE Transactions on Multimedia, 19(9):2056–2065, Sep. 2017. 4334

work page 2056

[1] [1]

L. Cao, Z. Liu, and T. S. Huang. Cross-dataset action de- tection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page

[2] [2]

Mxnet: A ﬂexible and efﬁcient machine learn- ing library for heterogeneous distributed systems

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A ﬂexible and efﬁcient machine learn- ing library for heterogeneous distributed systems. 2015

work page 2015

[3] [3]

Actor-centric re- lation network

Carl V ondrick Kevin Murphy Rahul Sukthankar Chen Sun, Abhinav Shrivastava and Cordelia Schmid. Actor-centric re- lation network. In European Conference on Computer Vision (ECCV), 2018

work page 2018

[4] [4]

Long-term recurrent convolutional net- works for visual recognition and description

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional net- works for visual recognition and description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 4332

work page 2015

[5] [5]

VideoCapsuleNet: A Simplified Network for Action Detection

Kevin Duarte, Yogesh Singh Rawat, and Mubarak Shah. Videocapsulenet: A simpliﬁed network for action detection. arXiv preprint arXiv:1805.08162, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982, 2018

work page arXiv 2018

[7] [7]

P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- manan. Object detection with discriminatively trained part- based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010

work page 2010

[8] [8]

Bottom-up segmentation for top-down detection

Sanja Fidler, Roozbeh Mottaghi, Alan Yuille, and Raquel Urtasun. Bottom-up segmentation for top-down detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013

work page 2013

[9] [9]

G. D. Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 1973

work page 1973

[10] [10]

K. Fu, Q. Zhao, and I. Y . Gu. Reﬁnet: A deep segmen- tation assisted reﬁnement network for salient object detec- tion. IEEE Transactions on Multimedia, 21(2):457–469, Feb 2019

work page 2019

[11] [11]

Video action transformer network

Rohit Girdhar, Jo ˜ao Carreira, Carl Doersch, and Andrew Zis- serman. Video action transformer network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019

[12] [12]

Rich feature hierarchies for accurate object detec- tion and semantic segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detec- tion and semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014

work page 2014

[13] [13]

Finding action tubes

Georgia Gkioxari and Jitendra Malik. Finding action tubes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

work page 2015

[14] [14]

Ross, Carl V on- drick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijaya- narasimhan, George Toderici, Susanna Ricco, Rahul Suk- thankar, Cordelia Schmid, and Jitendra Malik

Chunhui Gu, Chen Sun, David A. Ross, Carl V on- drick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijaya- narasimhan, George Toderici, Susanna Ricco, Rahul Suk- thankar, Cordelia Schmid, and Jitendra Malik. Ava: A video dataset of spatio-temporally localized atomic visual actions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

work page 2018

[15] [15]

Ibrahim, Zhiwei Deng, and Greg Mori

Jiawei He, Mostafa S. Ibrahim, Zhiwei Deng, and Greg Mori. Generic tubelet proposals for action localization. The IEEE Winter Conference on Applications of Computer Vision (WACV), 2018

work page 2018

[16] [16]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir- shick. Mask r-cnn. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017

work page 2017

[17] [17]

Tube convolu- tional neural network (t-cnn) for action detection in videos

Rui Hou, Chen Chen, and Mubarak Shah. Tube convolu- tional neural network (t-cnn) for action detection in videos. In The IEEE International Conference on Computer Vision (ICCV), 2017

work page 2017

[18] [18]

Action tubelet detector for spatio- temporal action localization

Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet detector for spatio- temporal action localization. InThe IEEE International Con- ference on Computer Vision (ICCV), 2017

work page 2017

[19] [19]

Action Tubelet Detector for Spatio- Temporal Action Localization

Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action Tubelet Detector for Spatio- Temporal Action Localization. In The IEEE International Conference on Computer Vision (ICCV), 2017

work page 2017

[20] [21]

Tian Lan, Yang Wang, and G. Mori. Discriminative ﬁgure- centric models for joint action localization and recogni- tion. In The International Conference on Computer Vision (ICCV), 2011

work page 2011

[21] [22]

Laptev, M

I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008

work page 2008

[22] [23]

Laptev and P

I. Laptev and P. Perez. Retrieving actions in movies. In The IEEE International Conference on Computer Vision (ICCV), 2007

work page 2007

[23] [24]

LeCun, B

Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Compu- tation, 1(4):541–551, 1989

work page 1989

[24] [25]

Re- current tubelet proposal and recognition networks for action detection

Dong Li, Zhaofan Qiu, Qi Dai, Ting Yao, and Tao Mei. Re- current tubelet proposal and recognition networks for action detection. In Vittorio Ferrari, Martial Hebert, Cristian Smin- chisescu, and Yair Weiss, editors, European Conference on Computer Vision (ECCV), 2018

work page 2018

[25] [26]

J. Li, X. Liang, J. Li, Y . Wei, T. Xu, J. Feng, and S. Yan. Mul- tistage object detection with group recursive learning. IEEE Transactions on Multimedia, 20(7):1645–1655, July 2018

work page 2018

[26] [27]

Detnet: Design backbone for object detection

Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Detnet: Design backbone for object detection. In The European Conference on Computer Vision (ECCV), 2018

work page 2018

[27] [28]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , July 2017

work page 2017

[28] [29]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017

work page 2017

[29] [30]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. Ssd: Single shot multibox detector. In European Con- ference on Computer Vision (ECCV), 2016

work page 2016

[30] [31]

Multi-region two- stream r-cnn for action detection

Xiaojiang Peng and Cordelia Schmid. Multi-region two- stream r-cnn for action detection. In European Conference on Computer Vision (ECCV), 2016

work page 2016

[31] [32]

Learning spatio- temporal representation with pseudo-3d residual networks

Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio- temporal representation with pseudo-3d residual networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017

work page 2017

[32] [33]

You only look once: Uniﬁed, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uniﬁed, real-time object de- tection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016

[33] [34]

Faster R-CNN: Towards real-time object detection with re- 4333 gion proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- 4333 gion proposal networks. In Advances in Neural Information Processing Systems (NIPS), 2015

work page 2015

[34] [35]

M. D. Rodriguez, J. Ahmed, and M. Shah. Action mach a spatio-temporal maximum average correlation height ﬁlter for action recognition. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008

work page 2008

[35] [36]

Recognizing ﬁne-grained and composite ac- tivities using hand-centric features and script data

Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, and Bernt Schiele. Recognizing ﬁne-grained and composite ac- tivities using hand-centric features and script data. Interna- tional Journal of Computer Vision (IJCV) , 119(3):346–373, Sep 2016

work page 2016

[36] [37]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015

work page 2015

[37] [38]

Suman Saha, Gurkirt Singh, Michael Sapienza, Philip H. S. Torr, and Fabio Cuzzolin. Deep learning for detecting multi- ple space-time action tubes in videos. 2016

work page 2016

[38] [39]

Simonyan and A

K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (NIPS), 2014

work page 2014

[39] [40]

Online real time multiple spatiotempo- ral action localisation and prediction

Gurkirt Singh, Suman Saha, Michael Sapienza, Philip Torr, and Fabio Cuzzolin. Online real time multiple spatiotempo- ral action localisation and prediction. 2017

work page 2017

[40] [41]

Khurram Soomro and Amir R. Zamir. Action Recognition in Realistic Sports Videos, pages 181–208. Springer Interna- tional Publishing, Cham, 2014

work page 2014

[41] [42]

M. A. Tahir, F. Yan, P. Koniusz, M. Awais, M. Barnard, K. Mikolajczyk, A. Bouridane, and J. Kittler. A robust and scalable visual category and action recognition system using kernel discriminant analysis with spectral regression. IEEE Transactions on Multimedia, 15(7):1653–1664, Nov 2013

work page 2013

[42] [43]

A closer look at spatiotemporal convolutions for action recognition

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2018

work page 2018

[43] [44]

J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recog- nition. International Journal of Computer Vision (IJCV) , 104(2):154–171, Sep 2013

work page 2013

[44] [45]

Action recognition with improved trajectories

Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In The IEEE International Conference on Computer Vision (ICCV), 2013

work page 2013

[45] [46]

Regionlets for generic object detection

Xiaoyu Wang, Ming Yang, Shenghuo Zhu, and Yuanqing Lin. Regionlets for generic object detection. In The IEEE International Conference on Computer Vision (ICCV) , De- cember 2013

work page 2013

[46] [47]

Learning to track for spatio-temporal action local- ization

Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. Learning to track for spatio-temporal action local- ization. In The IEEE International Conference on Computer Vision (ICCV), 2015

work page 2015

[47] [48]

Long-Term Feature Banks for Detailed Video Understanding

Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaim- ing He, Philipp Kr¨ahenb¨uhl, and Ross Girshick. Long-Term Feature Banks for Detailed Video Understanding. In The IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2019

work page 2019

[48] [49]

Zhang, C

S. Zhang, C. Gao, J. Zhang, F. Chen, and N. Sang. Discrim- inative part selection for human action recognition. IEEE Transactions on Multimedia, 20(4):769–780, April 2018

work page 2018

[49] [50]

X. Zhen, F. Zheng, L. Shao, X. Cao, and D. Xu. Supervised local descriptor learning for human action recognition.IEEE Transactions on Multimedia, 19(9):2056–2065, Sep. 2017. 4334

work page 2056