arxiv: 2604.10078 · v1 · submitted 2026-04-11 · 💻 cs.CV · cs.LG

Recognition: unknown

Attention-Guided Dual-Stream Learning for Group Engagement Recognition: Fusing Transformer-Encoded Motion Dynamics with Scene Context via Adaptive Gating

Muhammad E.H. Chowdhury, Saniah Kayenat Chowdhury

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:49 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords group engagement recognitiondual-stream learningtransformer encoderoptical flowattention pooling3D ResNetsoftmax-gated fusionclassroom videos

0 comments

The pith

Dual-stream model fuses transformer motion tracking with 3D scene context via gating to reach 96 percent accuracy on group engagement

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DualEngage, a two-stream framework for recognizing group-level student engagement in physical classroom videos by treating engagement as the joint outcome of individual behaviors and collective scene dynamics. The motion stream detects and tracks students, extracts dense optical flow, encodes temporal patterns with a transformer, and aggregates them via attention pooling. The scene stream processes the full video clip with a pretrained 3D residual network. These two representations are merged through a softmax-gated fusion layer that learns to weight each stream according to the current joint features. The approach is validated with fivefold cross-validation on the Classroom Group Engagement Dataset, where ablation confirms that the combined streams outperform either stream used alone.

Core claim

DualEngage models engagement as a joint function of both individual and group-level behaviors by encoding person-level motion dynamics through student detection, tracking, dense optical flow with the Recurrent All-Pairs Field Transforms network, transformer temporal encoding, and attention pooling into a unified representation, then combining this with scene-level spatiotemporal features from a pretrained three-dimensional Residual Network via softmax-gated fusion that dynamically weights each stream based on joint context.

What carries the argument

The softmax-gated fusion mechanism that dynamically weights the contribution of the transformer-encoded and attention-pooled motion stream against the 3D ResNet scene stream according to their joint feature context.

If this is right

The dual-stream architecture improves classification accuracy and macro F1 over single-stream baselines as demonstrated by the ablation experiments.
Individual motion dynamics captured by optical flow and transformer encoding supply information complementary to full-scene spatiotemporal features.
Attention pooling successfully converts per-student motion encodings into a single group-level representation.
Modeling engagement as the combination of person-level and scene-level signals enables accurate recognition from in-classroom video without relying on online-only or individual-only assumptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Automated tools built on this architecture could supply teachers with real-time indicators of how well group activities are sustaining collective attention.
The same motion-plus-scene fusion pattern could be tested on video from other group settings such as team meetings or laboratory collaborations.
Adding audio or speech features to the existing visual streams would constitute a direct next step for richer joint representations.
Cross-dataset experiments on classrooms varying in age, culture, or room layout would reveal the extent to which the learned gating and motion features transfer.

Load-bearing premise

The Classroom Group Engagement Dataset supplies representative and accurately labeled examples of group engagement levels that generalize beyond the specific classrooms and students recorded.

What would settle it

Retraining DualEngage on the Ocean University dataset and evaluating it on an independent collection of classroom videos from a different school or country; a large drop below 90 percent accuracy would indicate the performance is tied to dataset-specific patterns rather than general group engagement cues.

Figures

Figures reproduced from arXiv: 2604.10078 by Muhammad E.H. Chowdhury, Saniah Kayenat Chowdhury.

**Figure 1.** Figure 1: Overview of the proposed DualEngage. In the primary stream, optical flow is computed using RAFT; per-student motion tubes are cropped using tracked identities and modeled with a transformer encoder to yield a unified motion embedding before being aggregated via attention pooling. In the secondary stream, a 3D ResNet extracts scene-level features from the full spatiotemporal clip. The two feature sets are c… view at source ↗

**Figure 2.** Figure 2: Frame-by-frame student detection and tracking in classroom video. Top (round-table layout): a student enters at frame 208, and is assigned a new ID. Bottom (chessboard layout): two students are intermittently missed being detected due to congestion and occlusion. Colored boxes denote track IDs across frames. S. K. Chowdhury et al.: Preprint submitted to Elsevier Page 18 of 16 [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 3.** Figure 3: Optical flow map between two consecutive frames [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the motion extraction process from classroom video to student-specific sequences. The raw video is first converted into an optical flow map, followed by the formation of per-student flow sequences. Temporal dynamics are learned using a transformer encoder, which produces the stacked feature representation. S. K. Chowdhury et al.: Preprint submitted to Elsevier Page 19 of 16 [PITH_FULL_IMAGE:fi… view at source ↗

**Figure 5.** Figure 5: Confusion matrices representing the classification performance of the DualEngage architecture across five-fold cross-validation for group engagement recognition. S. K. Chowdhury et al.: Preprint submitted to Elsevier Page 20 of 16 [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Representative frames sampled across classroom clips in a chessboard layout. (a) High engagement example: students maintain attention toward the instructor/board, showing consistent orientation and participation. (b) Low engagement example: several students exhibit reduced attention, such as head-down posture or looking away, indicating weaker collective involvement. S. K. Chowdhury et al.: Preprint submit… view at source ↗

read the original abstract

Student engagement is crucial for improving learning outcomes in group activities. Highly engaged students perform better both individually and contribute to overall group success. However, most existing automated engagement recognition methods are designed for online classrooms or estimate engagement at the individual level. Addressing this gap, we propose DualEngage, a novel two-stream framework for group-level engagement recognition from in-classroom videos. It models engagement as a joint function of both individual and group-level behaviors. The primary stream models person-level motion dynamics by detecting and tracking students, extracting dense optical flow with the Recurrent All-Pairs Field Transforms network, encoding temporal motion patterns using a transformer encoder, and finally aggregating per-student representations through attention pooling into a unified representation. The secondary stream captures scene-level spatiotemporal information from the full video clip, leveraging a pretrained three-dimensional Residual Network. The two-stream representations are combined via softmax-gated fusion, which dynamically weights each stream's contribution based on the joint context of both features. DualEngage learns a joint representation of individual actions with overarching group dynamics. We evaluate the proposed approach using fivefold cross-validation on the Classroom Group Engagement Dataset developed by Ocean University of China, achieving an average classification accuracy of 0.9621+/-0.0161 with a macro-averaged F1 of 0.9530+/-0.0204. To understand the contribution of each branch, we further conduct an ablation study comparing single-stream variants against the two-stream model. This work is among the first in classroom engagement recognition to adopt a dual-stream design that explicitly leverages motion cues as an estimator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DualEngage fuses transformer motion features with 3D scene context via gating for group engagement and reports high CV accuracy on one new dataset, but external validation is missing.

read the letter

The main point is that this paper describes a dual-stream network for classifying group engagement levels from classroom videos. One stream tracks students, computes dense optical flow with RAFT, encodes the motion sequences with a transformer, and pools them with attention. The other stream runs the full clip through a pretrained 3D ResNet. The streams are combined with a learned softmax gate that weights them based on the current features. They test this on the Classroom Group Engagement Dataset with five-fold cross-validation and report 0.9621 accuracy and 0.9530 macro F1, plus ablations that show the combined model beats either stream alone.

Referee Report

2 major / 2 minor

Summary. The paper proposes DualEngage, a dual-stream architecture for group-level student engagement recognition in classroom videos. The primary stream detects and tracks individuals, computes dense optical flow via RAFT, encodes temporal dynamics with a transformer, and aggregates via attention pooling. The secondary stream extracts spatiotemporal scene features with a pretrained 3D ResNet. Representations are fused by softmax-gated adaptive weighting. On the Classroom Group Engagement Dataset from Ocean University of China, five-fold cross-validation yields 0.9621±0.0161 accuracy and 0.9530±0.0204 macro F1; ablations confirm the dual-stream benefit over single-stream variants.

Significance. If the performance holds under proper validation, the work is significant as one of the first explicit dual-stream treatments of group engagement that jointly models individual motion dynamics and scene context. The transformer-based motion encoding and learned gating mechanism constitute a clear technical contribution, and the internal ablation study supplies direct evidence that both streams are necessary. The approach addresses a genuine gap between individual/online engagement estimators and in-classroom group analysis.

major comments (2)

[Experiments] Experiments section: the manuscript reports 0.9621 accuracy and 0.9530 F1 on the Classroom Group Engagement Dataset but supplies no information on dataset size, number of clips, class balance, labeling protocol, inter-rater agreement, or student demographics. Without these details the central performance claim cannot be assessed for reliability or generalization beyond the single institutional source.
[Experimental results] Experimental results: while ablation tables compare single-stream variants to the full model, no quantitative comparison is provided against prior group or individual engagement methods. This omission makes it impossible to determine whether the reported numbers represent an advance over existing baselines.

minor comments (2)

[Abstract and Methods] The abstract and introduction refer to 'fivefold cross-validation' without stating whether folds are stratified by class or by video to avoid leakage; this should be clarified in the methods.
[Method] Notation for the gated fusion (softmax over the two stream features) is introduced without an explicit equation; adding a numbered equation would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the manuscript without misrepresenting our contributions or results.

read point-by-point responses

Referee: [Experiments] Experiments section: the manuscript reports 0.9621 accuracy and 0.9530 F1 on the Classroom Group Engagement Dataset but supplies no information on dataset size, number of clips, class balance, labeling protocol, inter-rater agreement, or student demographics. Without these details the central performance claim cannot be assessed for reliability or generalization beyond the single institutional source.

Authors: We agree that these dataset details are essential for evaluating reliability and generalization. The Classroom Group Engagement Dataset was developed at Ocean University of China, and we have the underlying statistics available. In the revised manuscript we will add a dedicated subsection in Experiments describing: total number of clips and students, class balance across engagement levels, the labeling protocol (including number of annotators and how consensus was reached), student demographics, and any inter-rater agreement statistics that were computed. If formal inter-rater agreement was not recorded during dataset creation, we will explicitly note this as a limitation and discuss its implications. revision: yes
Referee: [Experimental results] Experimental results: while ablation tables compare single-stream variants to the full model, no quantitative comparison is provided against prior group or individual engagement methods. This omission makes it impossible to determine whether the reported numbers represent an advance over existing baselines.

Authors: We acknowledge that external baselines are needed to contextualize the reported performance. While the primary contribution is the dual-stream architecture and the ablation evidence for both streams, we will add quantitative comparisons in the revised Experiments section. We will implement and evaluate representative prior individual-engagement methods (e.g., CNN- or LSTM-based student behavior models from the literature) on our group-level dataset, reporting accuracy and macro F1. We will also include any directly applicable group-engagement baselines if they exist. We will clearly discuss differences in task formulation (individual vs. group) and dataset characteristics so readers can interpret the gains fairly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper proposes a dual-stream neural architecture (transformer on optical flow + 3D ResNet + gated fusion) and reports its classification performance via five-fold cross-validation on a newly introduced dataset. No equations, first-principles derivations, or uniqueness theorems are claimed; the central result is an empirical accuracy number obtained by training and testing on held-out folds of the same data distribution. This is standard supervised learning practice and does not reduce the reported metric to a fitted parameter by construction or to a self-citation chain. No load-bearing self-citations or ansatzes are invoked in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim depends on standard supervised deep learning assumptions plus the representativeness of one university-collected dataset; no new physical laws or entities are introduced beyond the proposed architecture.

free parameters (1)

learned gating parameters
The softmax gate weights are optimized during training on the engagement dataset to balance the two streams.

axioms (2)

domain assumption Video labels accurately reflect true group engagement levels
Required for supervised training and evaluation on the Classroom Group Engagement Dataset.
domain assumption Optical flow and 3D ResNet features capture behaviorally relevant information
Invoked when using these pretrained components as inputs to the model.

invented entities (1)

DualEngage dual-stream framework no independent evidence
purpose: To jointly model individual motion and group scene context for engagement classification
New architecture introduced in the paper without independent external validation beyond the reported experiments.

pith-pipeline@v0.9.0 · 5602 in / 1539 out tokens · 57224 ms · 2026-05-10T16:49:31.617940+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 14 canonical work pages · 1 internal anchor

[1]

arXiv preprint arXiv:1812.10328

A multi-stream convolutional neural network framework for group activity recognition. arXiv preprint arXiv:1812.10328 . Baltrušaitis, T., Robinson, P., Morency, L.P.,

work page arXiv
[2]

Openface: an open source facial behavior analysis toolkit, in: 2016 IEEE winter conference on applications of computer vision (WACV), IEEE. pp. 1–10. Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.,

2016
[3]

European Conference on Computer Vision (ECCV), Springer

A naturalistic open source movie for optical flow evaluation, in: Proc. European Conference on Computer Vision (ECCV), Springer. pp. 611–625. URL:https://is.mpg.de/ps/publications/butler-eccv-2012, doi:10.1007/978-3-642-33783-3_44. Carreira, J., Zisserman, A., 2017a. Quo vadis, action recognition? a new model and the kinetics dataset, in: IEEE Conference ...

work page doi:10.1007/978-3-642-33783-3_44 2012
[4]

MMDetection: Open MMLab Detection Toolbox and Benchmark

Patterns of motivating teaching behaviour and student engagement: A microanalytic approach. European Journal of Psychology of Education 37, 227–255. Chen,C.F.R.,Panda,R.,Ramakrishnan,K.,Feris,R.,Cohn,J.,Oliva,A.,Fan,Q.,2021. Deepanalysisofcnn-basedspatio-temporalrepresentations for action recognition, in: Proceedings of the IEEE/CVF conference on computer...

work page Pith review arXiv 2021
[5]

doi:10.1109/ICCVW

What are they doing?: Collective activity classification using spatio-temporal relationship among people, in:2009IEEE12thInternationalConferenceonComputerVisionWorkshops(ICCVWorkshops),IEEE.pp.1282–1289. doi:10.1109/ICCVW. 2009.5457633. Christenson, S.L., Reschly, A.L., Wylie, C. (Eds.),

work page doi:10.1109/iccvw 2009
[6]

IEEE International Conference on Computer Vision (ICCV), pp

Flownet: Learning optical flow with convolutional networks, in: Proc. IEEE International Conference on Computer Vision (ICCV), pp. 2758–2766. URL:https: //openaccess.thecvf.com/content_iccv_2015/html/Dosovitskiy_FlowNet_Learning_Optical_ICCV_2015_paper.html. Dosovitskiy,A.,etal.,2021. Animageisworth16x16words:Transformersforimagerecognitionatscale,in:Inte...

2021
[7]

Introducing transfer learning to 3d resnet-18 for alzheimer’s disease detection on mri images, in: 2020 35th international conference on image and vision computing New Zealand (IVCNZ), IEEE. pp. 1–6. Farnebäck, G.,

2020
[8]

arXiv preprint arXiv:1608.08711

Engagement detection in meetings. arXiv preprint arXiv:1608.08711 . Fredricks, J.A., Blumenfeld, P.C., Paris, A.H.,

work page arXiv
[9]

Review of Educational Research 74, 59–109

School engagement: Potential of the concept, state of the evidence. Review of Educational Research 74, 59–109. Gortazar,L.,Hupkau,C.,Roldán-Monés,A.,2024. Onlinetutoringworks:Experimentalevidencefromaprogramwithvulnerablechildren. Journal of Public Economics 232, 105082. Gupta, A., D’Mello, S., Stephen, J., et al.,

2024
[10]

arXiv preprint arXiv:1609.01885 (2016)

Daisee: Towards user engagement recognition in the wild. arXiv:1609.01885. Hager,K.D.,2012. Self-monitoringasastrategytoincreasestudentteachers’useofeffectiveteachingpractices. RuralSpecialEducationQuarterly 31, 9–17. Hara, K., Kataoka, H., Satoh, Y.,

work page arXiv 2012
[11]

Hasnine,M.N.,Bui,H.T.,Tran,T.T.T.,Nguyen,H.T.,Akçapınar,G.,Ueda,H.,2021

Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?, in: CVPR. Hasnine,M.N.,Bui,H.T.,Tran,T.T.T.,Nguyen,H.T.,Akçapınar,G.,Ueda,H.,2021. Students’emotionextractionandvisualizationforengagement detection in online learning. Procedia Computer Science 192, 3423–3431. He, K., Zhang, X., Ren, S., Sun, J.,

2021
[12]

1971–1980

A hierarchical deep temporal model for group activity recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1971–1980. Ilse,M.,Tomczak,J.,Welling,M.,2018. Attention-baseddeepmultipleinstancelearning,in:Internationalconferenceonmachinelearning,PMLR. pp. 2127–2136. Jiang,X.,Qing,L.,Huang,J.,Guo,L.,Peng,Y.,2024. Unveilinggro...

1971
[13]

The Kinetics Human Action Video Dataset

Pathways to engagement: A longitudinal study of the first-year student experience in the educational interface. Higher Education 79, 657–673. Kay,W.,Carreira,J.,Simonyan,K.,Zhang,B.,Hillier,C.,Vijayanarasimhan,S.,Viola,F.,Green,T.,Back,T.,Natsev,P.,etal.,2017. Thekinetics human action video dataset. arXiv preprint arXiv:1705.06950 . Kingma, D.P., Ba, J.,

work page internal anchor Pith review arXiv 2017
[14]

introduces the HD1K optical flow benchmark

URL:https://www.researchgate.net/publication/311755798_The_HCI_Benchmark_Suite_Stereo_and_Flow_Ground_ Truth_with_Uncertainties_for_Urban_Autonomous_Driving, doi:10.1109/CVPRW.2016.10. introduces the HD1K optical flow benchmark. Koonce, B.,

work page doi:10.1109/cvprw.2016.10 2016
[15]

Deep learning-based student engagement detection using cnn, mobile net, resnet50, vgg16, in: 2025 International Conference on Innovations in Intelligent Systems: Advancements in Computing, Communication, and Cybersecurity (ISAC3), IEEE. pp. 1–5. Leiker,A.M.,Miller,M.,Brewer,L.,Nelson,M.,Siow,M.,Lohse,K.,2016. Therelationshipbetweenengagementandneurophysio...

2025
[16]

Electronic Journal of E-Learning 20, 1–18

The shift to online classes during the covid-19 pandemic: Benefits, challenges, and required improvements from the students’ perspective. Electronic Journal of E-Learning 20, 1–18. Lin,F.C.,Ngo,H.H.,Dow,C.R.,Lam,K.H.,Le,H.L.,2021. Studentbehaviorrecognitionsystemfortheclassroomenvironmentbasedonskeleton pose estimation and person detection. Sensors 21,

2021
[17]

dataset descriptor

doi:10.1038/s41597-025-04987-w. dataset descriptor. Matz-Costa, C., Cosner Berzin, S., Pitt-Catsouphes, M., Halvorsen, C.J.,

work page doi:10.1038/s41597-025-04987-w
[18]

In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2016),http://lmb.informatik.uni-freiburg.de/Publications/2016/ MIFDB16, arXiv:1512.02134

Perceptions of the meaningfulness of work among older social purpose workers: an ecological momentary assessment study. Journal of Applied Gerontology 38, 1121–1146. Mayer,N.,Ilg,E.,Häusser,P.,Fischer,P.,Cremers,D.,Dosovitskiy,A.,Brox,T.,2016. Alargedatasettotrainconvolutionalnetworksfordisparity, opticalflow,andsceneflowestimation,in:Proc.IEEEConferenceo...

work page arXiv 2016
[19]

Student engagement detection based on head pose estimation and facial expressions using transfer learning, in: The Proceedings of the International Conference on Smart City Applications, Springer. pp. 246–255. Qi,M.,Wang,Y.,Qin,J.,Li,A.,Luo,J.,VanGool,L.,2019. Stagnet:Anattentivesemanticrnnforgroupactivityandindividualactionrecognition. IEEE Transactions ...

2019
[20]

arXiv preprint arXiv:2302.01921

Transformers in action recognition: A review on temporal modeling. arXiv preprint arXiv:2302.01921 . Sharma, P., Joshi, S., Gautam, S., Maharjan, S., Khanal, S.R., Reis, M.C., Barroso, J., de Jesus Filipe, V.M.,

work page arXiv
[21]

Fostering learning with facial insights: Geometrical approach to real-time learner engagement detection, in: 2024 IEEE 9th International Conference for Convergence in Technology (I2CT), IEEE. pp. 1–6. Simonyan, K., Zisserman, A.,

2024
[22]

Sinatra,G.M.,Heddy,B.C.,Lombardi,D.,2015

Two-stream convolutional networks for action recognition in videos, in: Advances in Neural Information Processing Systems (NeurIPS). Sinatra,G.M.,Heddy,B.C.,Lombardi,D.,2015. Thechallengesofdefiningandmeasuringstudentengagementinscience. EducationalPsychologist 50, 1–13. S. K. Chowdhury et al.:Preprint submitted to ElsevierPage 15 of 16 Sinha, D., El-Shar...

2015
[23]

Thin mobilenet: An enhanced mobilenet architecture, in: 2019 IEEE 10th annual ubiquitous computing, electronics & mobile communication conference (UEMCON), IEEE. pp. 0280–0285. Stephens,K.,Bors,A.G.,2016.Humangroupactivityrecognitionbasedonmodellingmovingregionsinterdependencies,in:201623rdInternational Conference on Pattern Recognition (ICPR), IEEE. pp. ...

2019
[24]

Long range arena: A benchmark for efficient transformers,

Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8934–8943. Tay,Y.,Dehghani,M.,Abnar,S.,Shen,Y.,Bahri,D.,Pham,P.,Rao,J.,Yang,L.,Ruder,S.,Metzler,D.,2020. Longrangearena:Abenchmarkfor efficient transformers. arXiv preprint arXiv:2011.04006 . Teed, Z., Deng, J.,

work page arXiv 2020
[25]

Predicting student engagement in classrooms using facial behavioral cues, in: Proceedings of the 1st ACM SIGCHI international workshop on multimodal interaction for education, pp. 33–40. Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser,Ł.,Polosukhin,I.,2017. Attentionisallyouneed,in:Advances in neural information processing systems,...

2017
[26]

Computers and Education: Artificial Intelligence 5, 100187

Designing an artificial intelligence tool to understand student engagement based on teacher’s behaviours and movements in video conferencing. Computers and Education: Artificial Intelligence 5, 100187. Wang,D.,Zhu,X.,Liu,J.,Zhang,Z.,Zhou,Y.,2025. Multi-dimensionalconvolutiontransformerforgroupactivityrecognition. MultimediaTools and Applications 84, 27071...

2025
[27]

arXiv preprint arXiv:2001.08317 , year=

Simple online and realtime tracking with a deep association metric, in: ICIP, pp. 3645–3649. Wu,N.,Green,B.,Ben,X.,O’Banion,S.,2020. Deeptransformermodelsfortimeseriesforecasting:Theinfluenzaprevalencecase. arXivpreprint arXiv:2001.08317 . Xie, J., Wang, L., Webster, P., Yao, Y., Sun, J., Wang, S., Zhou, H.,

work page arXiv 2020
[28]

arXiv preprint arXiv:1911.11393

A two-stream end-to-end deep learning network for recognizing atypical visual attention in autism spectrum disorder. arXiv preprint arXiv:1911.11393 . Yan, S., Xiong, Y., Lin, D., 2018a. Spatial temporal graph convolutional networks for skeleton-based action recognition, in: Proceedings of the AAAI conference on artificial intelligence. Yan, X., Ricci, E....

work page arXiv 1911