Gesture Recognition in RGB Videos UsingHuman Body Keypoints and Dynamic Time Warping

Dietrich Paulus; Ivanna Kramer; Pascal Schneider; Raphael Memmesheimer

arxiv: 1906.12171 · v1 · pith:NRBMTVX7new · submitted 2019-06-25 · 💻 cs.CV · cs.LG· cs.RO

Gesture Recognition in RGB Videos UsingHuman Body Keypoints and Dynamic Time Warping

Pascal Schneider , Raphael Memmesheimer , Ivanna Kramer , Dietrich Paulus This is my paper

Pith reviewed 2026-05-25 16:35 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.RO

keywords gesture recognitionOpenPoseDynamic Time WarpingRGB videopose estimationtime series classificationhuman-robot interaction

0 comments

The pith

Gesture recognition from RGB video works by tracking body keypoints with OpenPose then aligning sequences via Dynamic Time Warping and nearest-neighbor lookup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that human gestures can be recognized from ordinary RGB video by first extracting 2D body keypoints and then comparing their time series with Dynamic Time Warping plus one-nearest-neighbor classification. A sympathetic reader would care because the method requires no specialized depth cameras and lets a user add a new gesture simply by recording a few example videos. It re-uses an existing deep-learning pose estimator instead of training a new network on gesture data. The approach is tested on a public dataset to measure how well the resulting similarity scores separate different gestures. If the claim holds, service robots could gain flexible, hardware-light gesture interfaces without large labeled collections.

Core claim

The central claim is that combining OpenPose keypoint trajectories with Dynamic Time Warping and 1NN produces reliable gesture classification on RGB video, while remaining independent of any particular capture hardware and allowing new gestures to be added by supplying only a few examples.

What carries the argument

OpenPose keypoint extraction followed by DTW+1NN alignment and comparison of the resulting 2D pose time series.

If this is right

Recognition runs on any RGB camera without depth sensors or custom rigs.
A new gesture enters the system by adding a handful of example videos to the reference set.
Classification operates on the temporal shape of pose trajectories rather than learned visual features.
The method avoids the data and compute cost of training an end-to-end gesture network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could be tested on other temporal tasks such as action segmentation where pose is the dominant cue.
Replacing DTW with a learned distance metric might improve accuracy while keeping the few-example property.
Deployment on a mobile robot would let gestures serve as an attention or command channel without retraining the vision stack.

Load-bearing premise

The 2D keypoints produced by OpenPose stay accurate and consistent enough across the target videos and conditions for DTW to yield reliable similarity scores.

What would settle it

Run the pipeline on a set of videos that vary in lighting, camera angle, or partial occlusion; if DTW distances no longer separate the gesture classes at rates comparable to the reported results, the method fails.

Figures

Figures reproduced from arXiv: 1906.12171 by Dietrich Paulus, Ivanna Kramer, Pascal Schneider, Raphael Memmesheimer.

**Figure 2.** Figure 2: Overview of the processing pipeline of our method. (Grey rectangles rep [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Normalized key point coordinates for a sequence of 44 images from a [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Confusion matrix for the classification of the actions given in Table 1. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

Gesture recognition opens up new ways for humans to intuitively interact with machines. Especially for service robots, gestures can be a valuable addition to the means of communication to, for example, draw the robot's attention to someone or something. Extracting a gesture from video data and classifying it is a challenging task and a variety of approaches have been proposed throughout the years. This paper presents a method for gesture recognition in RGB videos using OpenPose to extract the pose of a person and Dynamic Time Warping (DTW) in conjunction with One-Nearest-Neighbor (1NN) for time-series classification. The main features of this approach are the independence of any specific hardware and high flexibility, because new gestures can be added to the classifier by adding only a few examples of it. We utilize the robustness of the Deep Learning-based OpenPose framework while avoiding the data-intensive task of training a neural network ourselves. We demonstrate the classification performance of our method using a public dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Straightforward OpenPose + DTW pipeline for RGB gesture recognition that avoids training a new network, but the keypoint consistency assumption is unexamined.

read the letter

The paper puts together OpenPose for 2D keypoints from RGB video and then DTW with 1NN to classify gestures. New gestures are added by supplying a few examples rather than retraining. They run it on a public dataset and highlight hardware independence as the main practical upside. That is the core contribution: a lightweight combination of two off-the-shelf pieces that sidesteps the usual data hunger of end-to-end networks. The flexibility claim follows directly from using a distance-based classifier instead of a learned one. The approach is honest about what it re-uses and does not overclaim novelty in the method itself. The main weakness is exactly the one flagged in the stress-test note. Nothing in the abstract or the described pipeline quantifies how stable the OpenPose keypoints are on the target videos. No error rates on joint detection, no ablation on missing joints, and no tests under lighting or viewpoint changes that would matter for real RGB deployment. If the keypoints are noisy, DTW distances become unreliable and both accuracy and the few-example addition claim fall apart. The paper treats OpenPose robustness as given rather than measured. This is the sort of work that could serve as a simple baseline for service-robot gesture interfaces. A reader who needs a non-neural-network option and is willing to do their own validation on keypoint quality would get something usable from it. I would bring it to a reading group as maybe, mainly to look at the actual numbers and dataset details. I would not cite it in my own work. It is solid enough on its own terms to go to peer review rather than desk reject, provided the experiments include at least basic checks on the keypoints.

Referee Report

2 major / 2 minor

Summary. The paper proposes a gesture recognition method for RGB videos that extracts 2D human body keypoints via the pre-trained OpenPose framework and performs classification using Dynamic Time Warping (DTW) combined with 1-Nearest Neighbor (1NN). It highlights hardware independence (no custom sensors) and flexibility (new gestures added via a few example videos only), while avoiding training a neural network from scratch. Performance is demonstrated on a public dataset.

Significance. If the results hold under the stated conditions, the approach offers a lightweight, adaptable alternative to end-to-end learned models for gesture recognition in robotics and HCI. Credit is due for the explicit design choice to reuse off-the-shelf pose estimation and a parameter-light classifier, which directly supports the claimed ease of extending the gesture vocabulary without retraining.

major comments (2)

[Method description and Experiments] The central claim that OpenPose-derived keypoints form sufficiently clean and consistent time series for reliable DTW+1NN separation (even with few-shot addition of new classes) is load-bearing, yet the manuscript supplies no quantitative characterization of keypoint jitter, dropout rates, or occlusion effects on the chosen public dataset. This directly affects both reported accuracy and the hardware-independence claim.
[Experiments] No ablation is reported on the impact of missing or noisy joints (common in RGB video) on DTW distance computation or 1NN accuracy. Without this, it is impossible to assess whether the claimed flexibility survives realistic video conditions.

minor comments (2)

[Method] Clarify the exact DTW variant (e.g., Sakoe-Chiba band width, distance metric on keypoints) and any preprocessing of the 2D keypoint trajectories.
[Experiments] The public dataset should be named explicitly with citation and split details (train/test, number of gestures, subjects).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and agree that additional analyses will strengthen the paper. We will incorporate the suggested characterizations and ablations in the revised version.

read point-by-point responses

Referee: [Method description and Experiments] The central claim that OpenPose-derived keypoints form sufficiently clean and consistent time series for reliable DTW+1NN separation (even with few-shot addition of new classes) is load-bearing, yet the manuscript supplies no quantitative characterization of keypoint jitter, dropout rates, or occlusion effects on the chosen public dataset. This directly affects both reported accuracy and the hardware-independence claim.

Authors: We agree that explicit quantitative characterization of keypoint quality would better support the central claims. The current manuscript reports end-to-end classification results on the public dataset but does not include separate metrics for jitter, dropout, or occlusion. In the revision we will add a dedicated subsection with statistics on keypoint confidence scores, missing joint rates, and qualitative examples of occlusion handling from the dataset videos. This addition will directly address the hardware-independence claim. revision: yes
Referee: [Experiments] No ablation is reported on the impact of missing or noisy joints (common in RGB video) on DTW distance computation or 1NN accuracy. Without this, it is impossible to assess whether the claimed flexibility survives realistic video conditions.

Authors: We concur that an ablation on the effects of missing or noisy joints is necessary to evaluate robustness. The manuscript does not contain such an experiment. In the revised manuscript we will add an ablation study that systematically removes or perturbs joints in the keypoint sequences and reports the resulting change in DTW+1NN accuracy. This will clarify the limits of the few-shot flexibility under realistic RGB conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline uses external OpenPose and standard DTW+1NN on public data without self-referential fits or definitions.

full rationale

The paper presents a straightforward pipeline: OpenPose (external) extracts 2D keypoints from RGB video, followed by DTW distance computation and 1NN classification. No equations, fitted parameters, or predictions are described that reduce reported accuracy to quantities defined by the authors' own prior choices. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results occurs. The central claim of hardware independence and few-shot addition of gestures follows directly from the off-the-shelf components and the public dataset evaluation, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no equations, datasets, or implementation details available to enumerate free parameters or invented entities.

axioms (1)

domain assumption OpenPose produces sufficiently accurate and temporally consistent 2D keypoints on the input RGB videos
The entire pipeline rests on this unstated premise about the pose estimator's output quality.

pith-pipeline@v0.9.0 · 5705 in / 1187 out tokens · 24175 ms · 2026-05-25T16:35:41.117294+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,

C. Chen, R. Jafari, and N. Kehtarnavaz, “UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,” in 2015 IEEE International Conference on Image Processing (ICIP) . IEEE, 2015, pp. 168–172

work page 2015
[2]

Recognition of multivariate temporal musical gestures using n-dimensional dynamic time warping

N. Gillian, B. Knapp, and S. O’Modhrain, “Recognition of multivariate temporal musical gestures using n-dimensional dynamic time warping.” in Nime, 2011, pp. 337–342

work page 2011
[3]

Fast-gesture recognition and classiﬁcation using Kinect: An application for a virtual reality drumkit,

A. Rosa-Pujaz´ on, I. Barbancho, L. J. Tard´ on, and A. M. Barbancho, “Fast-gesture recognition and classiﬁcation using Kinect: An application for a virtual reality drumkit,” Multimedia Tools and Applications, vol. 75, no. 14, pp. 8137–8164, 2016

work page 2016
[4]

Multi-layered gesture recognition with Kinect,

F. Jiang, S. Zhang, S. Wu, Y. Gao, and D. Zhao, “Multi-layered gesture recognition with Kinect,” The Journal of Machine Learning Research , vol. 16, no. 1, pp. 227– 254, 2015

work page 2015
[5]

An approach to gesture recognition with skeletal data using dynamic time warping and nearest neighbour classiﬁer,

A. Rib´ o, D. Warchol, and W. Oszust, “An approach to gesture recognition with skeletal data using dynamic time warping and nearest neighbour classiﬁer,” In- ternational Journal of Intelligent Systems and Applications , vol. 8, no. 6, pp. 1–8, 2016

work page 2016
[6]

Probability-based dynamic time warping for gesture recognition on RGB-D data,

M. A. Bautista, A. Hern´ andez-Vela, V. Ponce, X. Perez-Sala, X. Bar´ o, O. Pujol, C. Angulo, and S. Escalera, “Probability-based dynamic time warping for gesture recognition on RGB-D data,” in International Workshop on Depth Image Analysis and Applications. Springer, 2012, pp. 126–135

work page 2012
[7]

Gesture recognition: A survey,

S. Mitra and T. Acharya, “Gesture recognition: A survey,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) , vol. 37, no. 3, pp. 311–324, 2007

work page 2007
[8]

Trajectory modeling in gesture recognition using cybergloves R⃝ and magnetic trackers,

N. Y. Y. Kevin, S. Ranganath, and D. Ghosh, “Trajectory modeling in gesture recognition using cybergloves R⃝ and magnetic trackers,” in 2004 IEEE Region 10 Conference TENCON 2004. IEEE, 2004, pp. 571–574

work page 2004
[9]

Feature weighting in dynamic time warping for gesture recognition in depth data,

M. Reyes, G. Dominguez, and S. Escalera, “Feature weighting in dynamic time warping for gesture recognition in depth data,” in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on . IEEE, 2011, pp. 1182–1188

work page 2011
[10]

Multi-dimensional dynamic time warping for gesture recognition,

G. A. Ten Holt, M. J. Reinders, and E. Hendriks, “Multi-dimensional dynamic time warping for gesture recognition,” in Thirteenth Annual Conference of the Advanced School for Computing and Imaging , vol. 300, 2007, p. 1

work page 2007
[11]

Fast time series classiﬁcation using numerosity reduction,

X. Xi, E. Keogh, C. Shelton, L. Wei, and C. A. Ratanamahatana, “Fast time series classiﬁcation using numerosity reduction,” in Proceedings of the 23rd International Conference on Machine Learning . ACM, 2006, pp. 1033–1040

work page 2006
[12]

The great time series classiﬁcation bake oﬀ: A review and experimental evaluation of recent algorithmic advances,

A. Bagnall, J. Lines, A. Bostrom, J. Large, and E. Keogh, “The great time series classiﬁcation bake oﬀ: A review and experimental evaluation of recent algorithmic advances,” Data Mining and Knowledge Discovery , vol. 31, no. 3, pp. 606–660, 2017

work page 2017
[13]

Dynamic programming algorithm optimization for spoken word recognition,

H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 1, pp. 43–49, 1978

work page 1978
[14]

Minimum prediction residual principle applied to speech recognition,

F. Itakura, “Minimum prediction residual principle applied to speech recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing , vol. 23, no. 1, pp. 67–72, 1975

work page 1975
[15]

Making time-series classiﬁcation more accurate using learned constraints,

C. A. Ratanamahatana and E. Keogh, “Making time-series classiﬁcation more accurate using learned constraints,” in Proceedings of the 2004 SIAM International Conference on Data Mining . SIAM, 2004, pp. 11–22

work page 2004
[16]

Toward accurate dynamic time warping in linear time and space,

S. Salvador and P. Chan, “Toward accurate dynamic time warping in linear time and space,” Intelligent Data Analysis , vol. 11, no. 5, pp. 561–580, 2007

work page 2007
[17]

M¨ uller,Information retrieval for music and motion

M. M¨ uller,Information retrieval for music and motion . Springer, 2007

work page 2007
[18]

Dynamic time warping algorithm review,

P. Senin, “Dynamic time warping algorithm review,” Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA , vol. 855, pp. 1–23, 2008

work page 2008
[19]

Two streams recurrent neural net- works for large-scale continuous gesture recognition,

X. Chai, Z. Liu, F. Yin, Z. Liu, and X. Chen, “Two streams recurrent neural net- works for large-scale continuous gesture recognition,” in 23rd International Con- ference on Pattern Recognition (ICPR) . IEEE, 2016, pp. 31–36

work page 2016
[20]

Online detec- tion and classiﬁcation of dynamic hand gestures with recurrent 3d convolutional neural network,

P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online detec- tion and classiﬁcation of dynamic hand gestures with recurrent 3d convolutional neural network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4207–4215

work page 2016
[21]

Gesture recognition for human-robot collaboration: A re- view,

H. Liu and L. Wang, “Gesture recognition for human-robot collaboration: A re- view,” International Journal of Industrial Ergonomics , vol. 68, pp. 355–367, 2018

work page 2018
[22]

Gesture recognition on human pose features of single images,

R. Memmesheimer, I. Mykhalchyshyna, and D. Paulus, “Gesture recognition on human pose features of single images,” in Intelligent Systems (IS), 2018 9th Inter- national Conference on . IEEE, 2018, pp. 1–7

work page 2018
[23]

Gesture recognition using skeleton data with weighted dynamic time warping

S. Celebi, A. S. Aydin, T. T. Temiz, and T. Arici, “Gesture recognition using skeleton data with weighted dynamic time warping.” in VISAPP (1) , 2013, pp. 620–625

work page 2013
[24]

A diﬀerential evolution approach to opti- mize weights of dynamic time warping for multi-sensor based gesture recognition,

J. Rwigema, H.-R. Choi, and T. Kim, “A diﬀerential evolution approach to opti- mize weights of dynamic time warping for multi-sensor based gesture recognition,” Sensors (Basel, Switzerland) , vol. 19, no. 5, p. 1007, 2019

work page 2019
[25]

Convolutional pose ma- chines,

S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose ma- chines,” in CVPR, 2016

work page 2016
[26]

Hand keypoint detection in single images using multiview bootstrapping,

T. Simon, H. Joo, I. Matthews, and Y. Sheikh, “Hand keypoint detection in single images using multiview bootstrapping,” in CVPR, 2017

work page 2017
[27]

OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “OpenPose: realtime multi-person 2D pose estimation using Part Aﬃnity Fields,” in arXiv preprint arXiv:1812.08008, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Derivative dynamic time warping,

E. J. Keogh and M. J. Pazzani, “Derivative dynamic time warping,” in Proceedings of the 2001 SIAM International Conference on Data Mining . SIAM, 2001, pp. 1–11

work page 2001

[1] [1]

UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,

C. Chen, R. Jafari, and N. Kehtarnavaz, “UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,” in 2015 IEEE International Conference on Image Processing (ICIP) . IEEE, 2015, pp. 168–172

work page 2015

[2] [2]

Recognition of multivariate temporal musical gestures using n-dimensional dynamic time warping

N. Gillian, B. Knapp, and S. O’Modhrain, “Recognition of multivariate temporal musical gestures using n-dimensional dynamic time warping.” in Nime, 2011, pp. 337–342

work page 2011

[3] [3]

Fast-gesture recognition and classiﬁcation using Kinect: An application for a virtual reality drumkit,

A. Rosa-Pujaz´ on, I. Barbancho, L. J. Tard´ on, and A. M. Barbancho, “Fast-gesture recognition and classiﬁcation using Kinect: An application for a virtual reality drumkit,” Multimedia Tools and Applications, vol. 75, no. 14, pp. 8137–8164, 2016

work page 2016

[4] [4]

Multi-layered gesture recognition with Kinect,

F. Jiang, S. Zhang, S. Wu, Y. Gao, and D. Zhao, “Multi-layered gesture recognition with Kinect,” The Journal of Machine Learning Research , vol. 16, no. 1, pp. 227– 254, 2015

work page 2015

[5] [5]

An approach to gesture recognition with skeletal data using dynamic time warping and nearest neighbour classiﬁer,

A. Rib´ o, D. Warchol, and W. Oszust, “An approach to gesture recognition with skeletal data using dynamic time warping and nearest neighbour classiﬁer,” In- ternational Journal of Intelligent Systems and Applications , vol. 8, no. 6, pp. 1–8, 2016

work page 2016

[6] [6]

Probability-based dynamic time warping for gesture recognition on RGB-D data,

M. A. Bautista, A. Hern´ andez-Vela, V. Ponce, X. Perez-Sala, X. Bar´ o, O. Pujol, C. Angulo, and S. Escalera, “Probability-based dynamic time warping for gesture recognition on RGB-D data,” in International Workshop on Depth Image Analysis and Applications. Springer, 2012, pp. 126–135

work page 2012

[7] [7]

Gesture recognition: A survey,

S. Mitra and T. Acharya, “Gesture recognition: A survey,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) , vol. 37, no. 3, pp. 311–324, 2007

work page 2007

[8] [8]

Trajectory modeling in gesture recognition using cybergloves R⃝ and magnetic trackers,

N. Y. Y. Kevin, S. Ranganath, and D. Ghosh, “Trajectory modeling in gesture recognition using cybergloves R⃝ and magnetic trackers,” in 2004 IEEE Region 10 Conference TENCON 2004. IEEE, 2004, pp. 571–574

work page 2004

[9] [9]

Feature weighting in dynamic time warping for gesture recognition in depth data,

M. Reyes, G. Dominguez, and S. Escalera, “Feature weighting in dynamic time warping for gesture recognition in depth data,” in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on . IEEE, 2011, pp. 1182–1188

work page 2011

[10] [10]

Multi-dimensional dynamic time warping for gesture recognition,

G. A. Ten Holt, M. J. Reinders, and E. Hendriks, “Multi-dimensional dynamic time warping for gesture recognition,” in Thirteenth Annual Conference of the Advanced School for Computing and Imaging , vol. 300, 2007, p. 1

work page 2007

[11] [11]

Fast time series classiﬁcation using numerosity reduction,

X. Xi, E. Keogh, C. Shelton, L. Wei, and C. A. Ratanamahatana, “Fast time series classiﬁcation using numerosity reduction,” in Proceedings of the 23rd International Conference on Machine Learning . ACM, 2006, pp. 1033–1040

work page 2006

[12] [12]

The great time series classiﬁcation bake oﬀ: A review and experimental evaluation of recent algorithmic advances,

A. Bagnall, J. Lines, A. Bostrom, J. Large, and E. Keogh, “The great time series classiﬁcation bake oﬀ: A review and experimental evaluation of recent algorithmic advances,” Data Mining and Knowledge Discovery , vol. 31, no. 3, pp. 606–660, 2017

work page 2017

[13] [13]

Dynamic programming algorithm optimization for spoken word recognition,

H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 1, pp. 43–49, 1978

work page 1978

[14] [14]

Minimum prediction residual principle applied to speech recognition,

F. Itakura, “Minimum prediction residual principle applied to speech recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing , vol. 23, no. 1, pp. 67–72, 1975

work page 1975

[15] [15]

Making time-series classiﬁcation more accurate using learned constraints,

C. A. Ratanamahatana and E. Keogh, “Making time-series classiﬁcation more accurate using learned constraints,” in Proceedings of the 2004 SIAM International Conference on Data Mining . SIAM, 2004, pp. 11–22

work page 2004

[16] [16]

Toward accurate dynamic time warping in linear time and space,

S. Salvador and P. Chan, “Toward accurate dynamic time warping in linear time and space,” Intelligent Data Analysis , vol. 11, no. 5, pp. 561–580, 2007

work page 2007

[17] [17]

M¨ uller,Information retrieval for music and motion

M. M¨ uller,Information retrieval for music and motion . Springer, 2007

work page 2007

[18] [18]

Dynamic time warping algorithm review,

P. Senin, “Dynamic time warping algorithm review,” Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA , vol. 855, pp. 1–23, 2008

work page 2008

[19] [19]

Two streams recurrent neural net- works for large-scale continuous gesture recognition,

X. Chai, Z. Liu, F. Yin, Z. Liu, and X. Chen, “Two streams recurrent neural net- works for large-scale continuous gesture recognition,” in 23rd International Con- ference on Pattern Recognition (ICPR) . IEEE, 2016, pp. 31–36

work page 2016

[20] [20]

Online detec- tion and classiﬁcation of dynamic hand gestures with recurrent 3d convolutional neural network,

P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online detec- tion and classiﬁcation of dynamic hand gestures with recurrent 3d convolutional neural network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4207–4215

work page 2016

[21] [21]

Gesture recognition for human-robot collaboration: A re- view,

H. Liu and L. Wang, “Gesture recognition for human-robot collaboration: A re- view,” International Journal of Industrial Ergonomics , vol. 68, pp. 355–367, 2018

work page 2018

[22] [22]

Gesture recognition on human pose features of single images,

R. Memmesheimer, I. Mykhalchyshyna, and D. Paulus, “Gesture recognition on human pose features of single images,” in Intelligent Systems (IS), 2018 9th Inter- national Conference on . IEEE, 2018, pp. 1–7

work page 2018

[23] [23]

Gesture recognition using skeleton data with weighted dynamic time warping

S. Celebi, A. S. Aydin, T. T. Temiz, and T. Arici, “Gesture recognition using skeleton data with weighted dynamic time warping.” in VISAPP (1) , 2013, pp. 620–625

work page 2013

[24] [24]

A diﬀerential evolution approach to opti- mize weights of dynamic time warping for multi-sensor based gesture recognition,

J. Rwigema, H.-R. Choi, and T. Kim, “A diﬀerential evolution approach to opti- mize weights of dynamic time warping for multi-sensor based gesture recognition,” Sensors (Basel, Switzerland) , vol. 19, no. 5, p. 1007, 2019

work page 2019

[25] [25]

Convolutional pose ma- chines,

S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose ma- chines,” in CVPR, 2016

work page 2016

[26] [26]

Hand keypoint detection in single images using multiview bootstrapping,

T. Simon, H. Joo, I. Matthews, and Y. Sheikh, “Hand keypoint detection in single images using multiview bootstrapping,” in CVPR, 2017

work page 2017

[27] [27]

OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “OpenPose: realtime multi-person 2D pose estimation using Part Aﬃnity Fields,” in arXiv preprint arXiv:1812.08008, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Derivative dynamic time warping,

E. J. Keogh and M. J. Pazzani, “Derivative dynamic time warping,” in Proceedings of the 2001 SIAM International Conference on Data Mining . SIAM, 2001, pp. 1–11

work page 2001