Loss Switching Fusion with Similarity Search for Video Classification

Du Q. Huynh; Lei Wang; Moussa Reda Mansour

arxiv: 1906.11465 · v1 · pith:BSSXSCLTnew · submitted 2019-06-27 · 💻 cs.CV · cs.LG

Loss Switching Fusion with Similarity Search for Video Classification

Lei Wang , Du Q. Huynh , Moussa Reda Mansour This is my paper

Pith reviewed 2026-05-25 15:03 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords video classificationloss switching fusionsimilarity searchspatiotemporal descriptorsbackground motionforeground motionscene understandingsoft voting

0 comments

The pith

A Loss Switching Fusion Network fuses spatiotemporal descriptors and adds similarity search with soft voting so one feature set can classify both background motions and human foreground motions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines video classification for outdoor scene understanding as the joint task of labeling background motion types and detecting human motions from the same representation. It introduces LSFNet to fuse the descriptors by switching losses during training and pairs the result with a similarity search plus soft voting step. The approach is evaluated on two private industry datasets. If the method works, video systems could handle multiple motion tasks without building separate feature pipelines for each.

Core claim

The central claim is that the proposed Loss Switching Fusion Network fuses spatiotemporal descriptors via a loss-switching mechanism and, combined with similarity search and soft voting, yields a system that remains robust when classifying different background motions and when detecting human motions from those backgrounds, all using the identical feature representation.

What carries the argument

Loss Switching Fusion Network (LSFNet) that alternates loss functions to fuse spatiotemporal descriptors, together with a similarity search scheme that applies soft voting for final classification.

If this is right

The same pipeline supports content-based video clustering.
It enables filtering of large video collections by motion type.
Background motion categories can be distinguished reliably.
Human motions can be isolated from surrounding background motions.
The system fits surveillance and streaming applications that need scene understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the loss-switching idea generalizes, similar switching could be tried for other descriptor fusion problems in video.
Lightweight design suggests the method might run on edge devices for real-time filtering.
Extending the similarity search step to temporal sequences longer than the training clips could be tested directly.

Load-bearing premise

The shared feature representation must stay robust enough to support both background-motion classification and human-motion detection without needing separate adaptations for each task.

What would settle it

A head-to-head test on a held-out video collection in which the LSFNet-plus-similarity-search pipeline shows no accuracy gain over ordinary descriptor fusion would falsify the robustness claim.

read the original abstract

From video streaming to security and surveillance applications, video data play an important role in our daily living today. However, managing a large amount of video data and retrieving the most useful information for the user remain a challenging task. In this paper, we propose a novel video classification system that would benefit the scene understanding task. We define our classification problem as classifying background and foreground motions using the same feature representation for outdoor scenes. This means that the feature representation needs to be robust enough and adaptable to different classification tasks. We propose a lightweight Loss Switching Fusion Network (LSFNet) for the fusion of spatiotemporal descriptors and a similarity search scheme with soft voting to boost the classification performance. The proposed system has a variety of potential applications such as content-based video clustering, video filtering, etc. Evaluation results on two private industry datasets show that our system is robust in both classifying different background motions and detecting human motions from these background motions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes a Loss Switching Fusion Network (LSFNet) to fuse spatiotemporal descriptors for video classification, combined with a similarity search scheme using soft voting. The central task is to classify background and foreground (human) motions in outdoor scenes using a single shared feature representation that must be robust and adaptable across tasks. The system is claimed to be lightweight with potential applications in video clustering and filtering. Robustness is asserted based on evaluation results from two private industry datasets.

Significance. If the robustness claims were verifiable, the method could contribute to scene understanding tasks in surveillance and streaming by enabling a shared representation for multiple motion classification problems. However, the absence of any quantitative metrics, baselines, error bars, or public replication details means the result, even if internally consistent, offers no reproducible advance or falsifiable prediction for the community.

major comments (1)

Abstract (evaluation results paragraph): the claim that the system 'is robust in both classifying different background motions and detecting human motions' rests entirely on two private industry datasets, yet supplies no performance numbers, baselines, statistical details, or method hyperparameters. This directly prevents any assessment of whether the LSFNet fusion or similarity search delivers the required adaptability stated as a prerequisite in the abstract.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the review and the opportunity to respond. We address the major comment below.

read point-by-point responses

Referee: [—] Abstract (evaluation results paragraph): the claim that the system 'is robust in both classifying different background motions and detecting human motions' rests entirely on two private industry datasets, yet supplies no performance numbers, baselines, statistical details, or method hyperparameters. This directly prevents any assessment of whether the LSFNet fusion or similarity search delivers the required adaptability stated as a prerequisite in the abstract.

Authors: We acknowledge that the abstract provides no numerical performance values, baselines, error bars, or hyperparameters, which limits independent verification of the robustness and adaptability claims. The manuscript centers on the LSFNet architecture for fusing spatiotemporal descriptors via loss switching and the similarity search with soft voting to support a shared representation across background and foreground motion tasks. Because the evaluation datasets are private industry collections, specific metrics and replication details cannot be released. The contribution is therefore presented primarily through the method description rather than through publicly verifiable quantitative results. revision: no

standing simulated objections not resolved

Private industry datasets prevent disclosure of performance numbers, baselines, statistical details, hyperparameters, or replication materials required for external assessment and reproducibility.

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper proposes LSFNet as a lightweight fusion network for spatiotemporal descriptors combined with a similarity search and soft voting scheme. No equations, derivations, or first-principles predictions appear in the provided abstract or description. The central claims rest on empirical evaluation rather than any mathematical reduction that equates outputs to inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations are present. The method is described as novel without invoking uniqueness theorems or ansatzes from prior author work. This is a standard empirical proposal whose performance claims stand or fall on the reported experiments, with no internal circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, training details, or modeling choices, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5688 in / 996 out tokens · 21745 ms · 2026-05-25T15:03:48.416666+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Retrieval in Long Surveillance Videos using User Described Motion and Object Attributes,

Greg castanon, Mohamed Elgharib, Venkatesh Saligrama, and Pierre-Marc Jodoin, “Retrieval in Long Surveillance Videos using User Described Motion and Object Attributes,” IEEE Transactions on Multimedia, pp. 1–13, 2014

work page 2014
[2]

Holistic Features for Real-time Crowd Behaviour Anomaly Detection,

Mark Marsden, Kevin McGuinness, Suzanne Little, and Noel E. O’Connor, “Holistic Features for Real-time Crowd Behaviour Anomaly Detection,” ICIP, 2016

work page 2016
[3]

Canonical Correlation-Based Feature Fusion Approach for Scene Classiﬁcation,

J. Arunnehru, A. Yashwanth, and Shaik Shammer, “Canonical Correlation-Based Feature Fusion Approach for Scene Classiﬁcation,” International Conference on Intelligent Systems Design and Applications , pp. 134–143, 2018

work page 2018
[4]

Anomaly detection with a moving Camera using Spatio-temporal Codebooks,

Mateus T. Nakahata, Lucas A. Thomaz, and Allan F. da Silva, “Anomaly detection with a moving Camera using Spatio-temporal Codebooks,” Multidim Syst Sign Process, pp. 1025–1054, 2018

work page 2018
[5]

An Online, Realtime Learning Method for Detecting Anomalies in Video using Spatio-temporal Compositions,

Mehrsan Javan Roshtkhari and Martin D. Levine, “An Online, Realtime Learning Method for Detecting Anomalies in Video using Spatio-temporal Compositions,” CVIU, 2013

work page 2013
[6]

Real-world Anomaly Detection in Surveillance Videos,

Waqas Sultani, Chen Chen, and Mubarak Shah, “Real-world Anomaly Detection in Surveillance Videos,” CVPR, pp. 1–10, 2018

work page 2018
[7]

An Efﬁcient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector,

Geert Willems, Tinne Tuytelaars, and Luc Van Gool, “An Efﬁcient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector,” ECCV, pp. 1–14, 2008

work page 2008
[8]

SURF: Speed Up Robust Features,

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool, “SURF: Speed Up Robust Features,” ECCV, pp. 1–14, 2006

work page 2006
[9]

Human Detection Using Ori- ented Histogram of Flow and Appearance,

Navneet Dalal, Bill Triggs, and Cordelia Schmid, “Human Detection Using Ori- ented Histogram of Flow and Appearance,” ECCV, pp. 428–441, 2006

work page 2006
[10]

Spatiotemporal GMM for Background Substraction with Super- pixel Hierarchy,

Mingliang Chen, Xing Wei, Qingxiong Yang, Qing Li, Gang Wang, and Ming- Hsuan Yang, “Spatiotemporal GMM for Background Substraction with Super- pixel Hierarchy,” TPAMI, pp. 1518–1525, 2018

work page 2018
[11]

Multiclass Object Classiﬁcation in Video Surveillance Systems Experimental Study,

Mohamed Elhoseiny, Amr Bakry, and Ahmed Elgammal, “Multiclass Object Classiﬁcation in Video Surveillance Systems Experimental Study,” CVPRW, pp. 788–793, 2013

work page 2013
[12]

A Bayesian Hierarchical Model for Learning Nat- ural Scene Categories,

Li Fei-Fei and Pietro Perona, “A Bayesian Hierarchical Model for Learning Nat- ural Scene Categories,” CVPR, 2005

work page 2005
[13]

Biolog- ically Inspired Features for Scene Classiﬁcation in Video Surveillance,

Kaiqi Huang, Dacheng Tao, Yuan Yuan, Xuelong Li, and Tieniu Tan, “Biolog- ically Inspired Features for Scene Classiﬁcation in Video Surveillance,” IEEE Transactions on Systems, Man, and Cybernetics , 2011

work page 2011
[14]

Histogram of Oriented Principal Components for Cross-View Action Recognition,

Hossein Rahmani, Arif Mahmood, Du Huynh, and Ajmal Mian, “Histogram of Oriented Principal Components for Cross-View Action Recognition,”TPAMI, pp. 2430–2443, December 2016

work page 2016
[15]

HOPC: His- togram of Oriented Principal Components of 3D Pointclouds for Action Recogni- tion,

Hossein Rahmani, Arif Mahmood, Du Q Huynh, and Ajmal Mian, “HOPC: His- togram of Oriented Principal Components of 3D Pointclouds for Action Recogni- tion,” in ECCV, 2014, pp. 742–757

work page 2014
[16]

Content-based In- door/Outdoor Video Classiﬁcation System for a Mobile Platform,

Mitko Veta, Tomislav Kartalov, and Zoran Ivanovski, “Content-based In- door/Outdoor Video Classiﬁcation System for a Mobile Platform,” International Journal of Electrical and Computer Engineering , 2009

work page 2009
[17]

Appearance-and-Relation Networks for Video Classiﬁcation,

Limin Wang, Wei Li, Wen Li, and Luc Van Gool, “Appearance-and-Relation Networks for Video Classiﬁcation,” CVPR, 2018

work page 2018
[18]

Fast Video Classiﬁcation via Adaptive Cascading of Deep Models,

Haichen Shen, Seungyeop Han, Matthai Philipose, and Arvind Krishnamurthy, “Fast Video Classiﬁcation via Adaptive Cascading of Deep Models,”CVPR, 2017

work page 2017
[19]

Attention Clusters: Purely Attention Based Local Feature Integration for Video Classiﬁcation,

Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen, “Attention Clusters: Purely Attention Based Local Feature Integration for Video Classiﬁcation,” CVPR, 2018

work page 2018
[20]

Learning Spatiotemporal Features with 3D Convolutional Networks,

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri, “Learning Spatiotemporal Features with 3D Convolutional Networks,” ICCV, pp. 4489–4497, 2015

work page 2015
[21]

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,

Joao Carreira and Andrew Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,” CVPR, pp. 1–10, 2018

work page 2018
[22]

Improved Dense Trajectory with Cross Streams,

Katsunori Ohnishi, Masatoshi Hidaka, and Tatsuya Harada, “Improved Dense Trajectory with Cross Streams,” ACMMM, pp. 1–6, 2016

work page 2016
[23]

Action Recognition with Trajectory- Pooled Deep-Convolutional Descriptors,

Limin Wang, Yu Qiao, and Xiaoou Tang, “Action Recognition with Trajectory- Pooled Deep-Convolutional Descriptors,” CVPR, pp. 1–10, 2015

work page 2015
[24]

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition,

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh, “Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition,” ICCV, pp. 3154– 3160, 2017

work page 2017
[25]

Realtime Video Clas- siﬁcation using Dense HOF/HOG,

J.R.R. Uijlings, I.C. Duta, N. Rostamzadeh, and N. Sebe, “Realtime Video Clas- siﬁcation using Dense HOF/HOG,” ICMR, 2014

work page 2014
[26]

Action Recognition by Dense Trajectories,

Heng Wang, Alexander Klaser, Cordelia Schmid, and Liu Cheng-Lin, “Action Recognition by Dense Trajectories,” CVPR, pp. 3169–3176, 2011

work page 2011
[27]

A Spatio-Temporal Descriptor Based on 3D-Gradients,

Alexander Klaser, Marcin Marszalek, and Cordelia Schmid, “A Spatio-Temporal Descriptor Based on 3D-Gradients,” BMCV, pp. 1–10, 2008

work page 2008
[28]

A 3-Dimentional SIFT Descriptor and its Application to Action Recognition,

Paul Scovanner, Saad Ali, and Mubarak Shah, “A 3-Dimentional SIFT Descriptor and its Application to Action Recognition,” CRCV, pp. 1–4, 2007

work page 2007
[29]

Dense Trajectories and Motion Boundary Descriptors for Action Recognition,

Heng Wang, Alexander Klaser, Cordelia Schmid, and Cheng-Lin Liu, “Dense Trajectories and Motion Boundary Descriptors for Action Recognition,” IJCV, 2013

work page 2013
[30]

Action Recognition with Improved Trajecto- ries,

Heng Wang and Cordelia Schmid, “Action Recognition with Improved Trajecto- ries,” ICCV, pp. 3551–3558, 2013

work page 2013
[31]

Unsupervised Local Feature Hashing for Image Similarity Search,

Li Liu, Mengyang Yu, and Ling Shao, “Unsupervised Local Feature Hashing for Image Similarity Search,” IEEE Transactions on Cybernetics, pp. 1–11, 2015

work page 2015
[32]

A Survey on Learning to Hash,

Jingdong Wang, Ting Zhang, Jingkuan Song, Nicu Sebe, and Heng Tao Shen, “A Survey on Learning to Hash,” TPAMI, pp. 1–21, 2017

work page 2017
[33]

Enhanced Feature Selection Algorithm using Modiﬁed Fisher Criterion and Principal Feature Analysis,

L. Arockiam and V . Arul Kumar, “Enhanced Feature Selection Algorithm using Modiﬁed Fisher Criterion and Principal Feature Analysis,” International Journal of Advanced Research in Computer Science , pp. 310–314, 2012

work page 2012
[34]

Feature Selection By Combining Fisher Criterion and Principal Feature Analysis,

Sa Wang, Cheng-Lin Liu, and Lian Zheng, “Feature Selection By Combining Fisher Criterion and Principal Feature Analysis,” International Conference on Machine Learning and Cybernetics , pp. 1149–1154, 2007

work page 2007
[35]

Umap: Uniform manifold approximation and projection,

Leland McInnes, John Healy, Nathaniel Saul, and Lukas Grossberger, “Umap: Uniform manifold approximation and projection,” The Journal of Open Source Software, vol. 3, no. 29, pp. 861, 2018

work page 2018
[36]

UMAP: Uniform Manifold Approximation and Pro- jection for Dimension Reduction,

L. McInnes and J. Healy, “UMAP: Uniform Manifold Approximation and Pro- jection for Dimension Reduction,” ArXiv e-prints, Feb. 2018

work page 2018
[37]

Fisher Kernels on Visual V ocabularies for Image Categorization,

Florent Perronnin and Christopher Dance, “Fisher Kernels on Visual V ocabularies for Image Categorization,” CVPR, pp. 1–8, 2009

work page 2009
[38]

Improving the Fisher Kernel for Large-Scale Image Classiﬁcation,

Florent Perronnin, Jorge Sanchez, and Thomas Mensink, “Improving the Fisher Kernel for Large-Scale Image Classiﬁcation,” ECCV, pp. 143–156, 2010. 5

work page 2010

[1] [1]

Retrieval in Long Surveillance Videos using User Described Motion and Object Attributes,

Greg castanon, Mohamed Elgharib, Venkatesh Saligrama, and Pierre-Marc Jodoin, “Retrieval in Long Surveillance Videos using User Described Motion and Object Attributes,” IEEE Transactions on Multimedia, pp. 1–13, 2014

work page 2014

[2] [2]

Holistic Features for Real-time Crowd Behaviour Anomaly Detection,

Mark Marsden, Kevin McGuinness, Suzanne Little, and Noel E. O’Connor, “Holistic Features for Real-time Crowd Behaviour Anomaly Detection,” ICIP, 2016

work page 2016

[3] [3]

Canonical Correlation-Based Feature Fusion Approach for Scene Classiﬁcation,

J. Arunnehru, A. Yashwanth, and Shaik Shammer, “Canonical Correlation-Based Feature Fusion Approach for Scene Classiﬁcation,” International Conference on Intelligent Systems Design and Applications , pp. 134–143, 2018

work page 2018

[4] [4]

Anomaly detection with a moving Camera using Spatio-temporal Codebooks,

Mateus T. Nakahata, Lucas A. Thomaz, and Allan F. da Silva, “Anomaly detection with a moving Camera using Spatio-temporal Codebooks,” Multidim Syst Sign Process, pp. 1025–1054, 2018

work page 2018

[5] [5]

An Online, Realtime Learning Method for Detecting Anomalies in Video using Spatio-temporal Compositions,

Mehrsan Javan Roshtkhari and Martin D. Levine, “An Online, Realtime Learning Method for Detecting Anomalies in Video using Spatio-temporal Compositions,” CVIU, 2013

work page 2013

[6] [6]

Real-world Anomaly Detection in Surveillance Videos,

Waqas Sultani, Chen Chen, and Mubarak Shah, “Real-world Anomaly Detection in Surveillance Videos,” CVPR, pp. 1–10, 2018

work page 2018

[7] [7]

An Efﬁcient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector,

Geert Willems, Tinne Tuytelaars, and Luc Van Gool, “An Efﬁcient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector,” ECCV, pp. 1–14, 2008

work page 2008

[8] [8]

SURF: Speed Up Robust Features,

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool, “SURF: Speed Up Robust Features,” ECCV, pp. 1–14, 2006

work page 2006

[9] [9]

Human Detection Using Ori- ented Histogram of Flow and Appearance,

Navneet Dalal, Bill Triggs, and Cordelia Schmid, “Human Detection Using Ori- ented Histogram of Flow and Appearance,” ECCV, pp. 428–441, 2006

work page 2006

[10] [10]

Spatiotemporal GMM for Background Substraction with Super- pixel Hierarchy,

Mingliang Chen, Xing Wei, Qingxiong Yang, Qing Li, Gang Wang, and Ming- Hsuan Yang, “Spatiotemporal GMM for Background Substraction with Super- pixel Hierarchy,” TPAMI, pp. 1518–1525, 2018

work page 2018

[11] [11]

Multiclass Object Classiﬁcation in Video Surveillance Systems Experimental Study,

Mohamed Elhoseiny, Amr Bakry, and Ahmed Elgammal, “Multiclass Object Classiﬁcation in Video Surveillance Systems Experimental Study,” CVPRW, pp. 788–793, 2013

work page 2013

[12] [12]

A Bayesian Hierarchical Model for Learning Nat- ural Scene Categories,

Li Fei-Fei and Pietro Perona, “A Bayesian Hierarchical Model for Learning Nat- ural Scene Categories,” CVPR, 2005

work page 2005

[13] [13]

Biolog- ically Inspired Features for Scene Classiﬁcation in Video Surveillance,

Kaiqi Huang, Dacheng Tao, Yuan Yuan, Xuelong Li, and Tieniu Tan, “Biolog- ically Inspired Features for Scene Classiﬁcation in Video Surveillance,” IEEE Transactions on Systems, Man, and Cybernetics , 2011

work page 2011

[14] [14]

Histogram of Oriented Principal Components for Cross-View Action Recognition,

Hossein Rahmani, Arif Mahmood, Du Huynh, and Ajmal Mian, “Histogram of Oriented Principal Components for Cross-View Action Recognition,”TPAMI, pp. 2430–2443, December 2016

work page 2016

[15] [15]

HOPC: His- togram of Oriented Principal Components of 3D Pointclouds for Action Recogni- tion,

Hossein Rahmani, Arif Mahmood, Du Q Huynh, and Ajmal Mian, “HOPC: His- togram of Oriented Principal Components of 3D Pointclouds for Action Recogni- tion,” in ECCV, 2014, pp. 742–757

work page 2014

[16] [16]

Content-based In- door/Outdoor Video Classiﬁcation System for a Mobile Platform,

Mitko Veta, Tomislav Kartalov, and Zoran Ivanovski, “Content-based In- door/Outdoor Video Classiﬁcation System for a Mobile Platform,” International Journal of Electrical and Computer Engineering , 2009

work page 2009

[17] [17]

Appearance-and-Relation Networks for Video Classiﬁcation,

Limin Wang, Wei Li, Wen Li, and Luc Van Gool, “Appearance-and-Relation Networks for Video Classiﬁcation,” CVPR, 2018

work page 2018

[18] [18]

Fast Video Classiﬁcation via Adaptive Cascading of Deep Models,

Haichen Shen, Seungyeop Han, Matthai Philipose, and Arvind Krishnamurthy, “Fast Video Classiﬁcation via Adaptive Cascading of Deep Models,”CVPR, 2017

work page 2017

[19] [19]

Attention Clusters: Purely Attention Based Local Feature Integration for Video Classiﬁcation,

Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen, “Attention Clusters: Purely Attention Based Local Feature Integration for Video Classiﬁcation,” CVPR, 2018

work page 2018

[20] [20]

Learning Spatiotemporal Features with 3D Convolutional Networks,

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri, “Learning Spatiotemporal Features with 3D Convolutional Networks,” ICCV, pp. 4489–4497, 2015

work page 2015

[21] [21]

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,

Joao Carreira and Andrew Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,” CVPR, pp. 1–10, 2018

work page 2018

[22] [22]

Improved Dense Trajectory with Cross Streams,

Katsunori Ohnishi, Masatoshi Hidaka, and Tatsuya Harada, “Improved Dense Trajectory with Cross Streams,” ACMMM, pp. 1–6, 2016

work page 2016

[23] [23]

Action Recognition with Trajectory- Pooled Deep-Convolutional Descriptors,

Limin Wang, Yu Qiao, and Xiaoou Tang, “Action Recognition with Trajectory- Pooled Deep-Convolutional Descriptors,” CVPR, pp. 1–10, 2015

work page 2015

[24] [24]

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition,

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh, “Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition,” ICCV, pp. 3154– 3160, 2017

work page 2017

[25] [25]

Realtime Video Clas- siﬁcation using Dense HOF/HOG,

J.R.R. Uijlings, I.C. Duta, N. Rostamzadeh, and N. Sebe, “Realtime Video Clas- siﬁcation using Dense HOF/HOG,” ICMR, 2014

work page 2014

[26] [26]

Action Recognition by Dense Trajectories,

Heng Wang, Alexander Klaser, Cordelia Schmid, and Liu Cheng-Lin, “Action Recognition by Dense Trajectories,” CVPR, pp. 3169–3176, 2011

work page 2011

[27] [27]

A Spatio-Temporal Descriptor Based on 3D-Gradients,

Alexander Klaser, Marcin Marszalek, and Cordelia Schmid, “A Spatio-Temporal Descriptor Based on 3D-Gradients,” BMCV, pp. 1–10, 2008

work page 2008

[28] [28]

A 3-Dimentional SIFT Descriptor and its Application to Action Recognition,

Paul Scovanner, Saad Ali, and Mubarak Shah, “A 3-Dimentional SIFT Descriptor and its Application to Action Recognition,” CRCV, pp. 1–4, 2007

work page 2007

[29] [29]

Dense Trajectories and Motion Boundary Descriptors for Action Recognition,

Heng Wang, Alexander Klaser, Cordelia Schmid, and Cheng-Lin Liu, “Dense Trajectories and Motion Boundary Descriptors for Action Recognition,” IJCV, 2013

work page 2013

[30] [30]

Action Recognition with Improved Trajecto- ries,

Heng Wang and Cordelia Schmid, “Action Recognition with Improved Trajecto- ries,” ICCV, pp. 3551–3558, 2013

work page 2013

[31] [31]

Unsupervised Local Feature Hashing for Image Similarity Search,

Li Liu, Mengyang Yu, and Ling Shao, “Unsupervised Local Feature Hashing for Image Similarity Search,” IEEE Transactions on Cybernetics, pp. 1–11, 2015

work page 2015

[32] [32]

A Survey on Learning to Hash,

Jingdong Wang, Ting Zhang, Jingkuan Song, Nicu Sebe, and Heng Tao Shen, “A Survey on Learning to Hash,” TPAMI, pp. 1–21, 2017

work page 2017

[33] [33]

Enhanced Feature Selection Algorithm using Modiﬁed Fisher Criterion and Principal Feature Analysis,

L. Arockiam and V . Arul Kumar, “Enhanced Feature Selection Algorithm using Modiﬁed Fisher Criterion and Principal Feature Analysis,” International Journal of Advanced Research in Computer Science , pp. 310–314, 2012

work page 2012

[34] [34]

Feature Selection By Combining Fisher Criterion and Principal Feature Analysis,

Sa Wang, Cheng-Lin Liu, and Lian Zheng, “Feature Selection By Combining Fisher Criterion and Principal Feature Analysis,” International Conference on Machine Learning and Cybernetics , pp. 1149–1154, 2007

work page 2007

[35] [35]

Umap: Uniform manifold approximation and projection,

Leland McInnes, John Healy, Nathaniel Saul, and Lukas Grossberger, “Umap: Uniform manifold approximation and projection,” The Journal of Open Source Software, vol. 3, no. 29, pp. 861, 2018

work page 2018

[36] [36]

UMAP: Uniform Manifold Approximation and Pro- jection for Dimension Reduction,

L. McInnes and J. Healy, “UMAP: Uniform Manifold Approximation and Pro- jection for Dimension Reduction,” ArXiv e-prints, Feb. 2018

work page 2018

[37] [37]

Fisher Kernels on Visual V ocabularies for Image Categorization,

Florent Perronnin and Christopher Dance, “Fisher Kernels on Visual V ocabularies for Image Categorization,” CVPR, pp. 1–8, 2009

work page 2009

[38] [38]

Improving the Fisher Kernel for Large-Scale Image Classiﬁcation,

Florent Perronnin, Jorge Sanchez, and Thomas Mensink, “Improving the Fisher Kernel for Large-Scale Image Classiﬁcation,” ECCV, pp. 143–156, 2010. 5

work page 2010