Recognition: no theorem link
Variational Encoder--Multi-Decoder (VE-MD) for Privacy-by-functional-design (Group) Emotion Recognition
Pith reviewed 2026-05-13 21:49 UTC · model grok-4.3
The pith
VE-MD framework recognizes group emotions at state-of-the-art levels by outputting only aggregate affect from shared structural representations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VE-MD constrains outputs to aggregate group-level affect by training a variational encoder on a latent space that supports decoding of variable-size body and face structures; explicit structural supervision maintains interaction-related information needed for collective inference, whereas latent space optimization by itself attenuates those cues in group emotion recognition but provides denoising benefits in individual emotion recognition.
What carries the argument
Variational Encoder-Multi-Decoder (VE-MD) architecture whose decoders (transformer PersonQuery or dense Heatmap) predict body and facial structural representations to regularize the shared latent space for group affect classification.
If this is right
- Structural supervision consistently raises performance on group emotion datasets by retaining interaction cues.
- The same structural outputs improve individual emotion recognition by acting as a denoising bottleneck.
- Multimodal fusion with audio reaches 82.25 percent on VGAF and outperforms prior work on SamSemo at 77.9 percent.
- Group-level tasks require explicit structural preservation, unlike individual tasks where latent optimization suffices.
Where Pith is reading between the lines
- The same encoder-decoder split could be applied to other collective behaviors such as group attention or crowd dynamics without adding per-person outputs.
- Deployment in regulated environments would still need separate verification that latent features cannot be inverted to recover identities.
- Replacing the structural decoders with purely latent constraints would likely reduce accuracy on group datasets while preserving the privacy property.
- The observed GER versus IER performance gap suggests that collective affect modeling benefits from mechanisms that explicitly encode pairwise or spatial relations.
Load-bearing premise
Avoiding explicit per-person emotion or identity outputs is enough to deliver meaningful privacy protection when the model is deployed on real group data.
What would settle it
An adversary successfully reconstructing individual identities or per-person emotions from the latent representations or decoder outputs on a held-out test set would falsify the privacy claim.
Figures
read the original abstract
Group Emotion Recognition (GER) aims to infer collective affect in social environments such as classrooms, crowds, and public events. Many existing approaches rely on explicit individual-level processing, including cropped faces, person tracking, or per-person feature extraction, which makes the analysis pipeline person-centric and raises privacy concerns in deployment scenarios where only group-level understanding is needed. This research proposes VE-MD, a Variational Encoder-Multi-Decoder framework for group emotion recognition under a privacy-aware functional design. Rather than providing formal anonymization or cryptographic privacy guarantees, VE-MD is designed to avoid explicit individual monitoring by constraining the model to predict only aggregate group-level affect, without identity recognition or per-person emotion outputs. VE-MD learns a shared latent representation jointly optimized for emotion classification and internal prediction of body and facial structural representations. Two structural decoding strategies are investigated: a transformer-based PersonQuery decoder and a dense Heatmap decoder that naturally accommodates variable group sizes. Experiments on six in-the-wild datasets, including two GER and four Individual Emotion Recognition (IER) benchmarks, show that structural supervision consistently improves representation learning. More importantly, the results reveal a clear distinction between GER and IER: optimizing the latent space alone is often insufficient for GER because it tends to attenuate interaction-related cues, whereas preserving explicit structural outputs improves collective affect inference. In contrast, projected structural representations seem to act as an effective denoising bottleneck for IER. VE-MD achieves state-of-the-art performance on GAF-3.0 (up to 90.06%) and VGAF (82.25% with multimodal fusion with audio). These results show that preserving interaction-related structural information is particularly beneficial for group-level affect modeling without relying on prior individual feature extraction. On IER datasets using multimodal fusion with audio modality, VE-MD outperforms SOTA on SamSemo (77.9%, adding text modality) while achieving competitive performances on MER-MULTI (63.8%), DFEW (70.7%) and EngageNet (69.0).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes VE-MD, a Variational Encoder-Multi-Decoder framework for group emotion recognition (GER) under a privacy-by-functional-design approach. It learns a shared latent representation jointly optimized for group-level affect classification and structural decoding of faces/bodies (via PersonQuery transformer or Heatmap decoder), explicitly avoiding per-person emotion or identity outputs. Experiments across six in-the-wild datasets report SOTA results on GER benchmarks (90.06% on GAF-3.0; 82.25% on VGAF with audio fusion) and competitive performance on individual emotion recognition (IER) tasks, arguing that explicit structural outputs preserve interaction cues beneficial for GER but act as a denoising bottleneck for IER.
Significance. If the empirical results and the functional privacy premise hold under scrutiny, the work would provide a practical route to privacy-aware GER in sensitive settings (classrooms, crowds) without relying on explicit individual processing or cryptographic primitives. The reported distinction between GER and IER regarding the utility of structural supervision is a potentially useful insight for representation learning in affective computing.
major comments (2)
- [Abstract] Abstract: The privacy-by-functional-design claim is load-bearing for the paper's contribution yet rests on the unverified premise that group-level outputs plus structural decoders (PersonQuery/Heatmap) cannot leak individual identity or per-person affect; no reconstruction metrics, membership-inference results, or inversion attacks on the latent space are reported despite the explicit disclaimer of formal guarantees.
- [Abstract] Abstract / Experimental section: SOTA numbers (90.06% on GAF-3.0, 82.25% on VGAF) are stated without reference to the experimental protocol, baseline implementations, ablation controls, or statistical significance tests, preventing verification that gains are attributable to the claimed VE-MD mechanisms rather than dataset-specific factors or multimodal fusion.
minor comments (1)
- The motivation for why structural supervision attenuates interaction cues in GER but denoises IER could be strengthened with a brief comparison to prior interaction-modeling literature.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below with clarifications and proposed revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The privacy-by-functional-design claim is load-bearing for the paper's contribution yet rests on the unverified premise that group-level outputs plus structural decoders (PersonQuery/Heatmap) cannot leak individual identity or per-person affect; no reconstruction metrics, membership-inference results, or inversion attacks on the latent space are reported despite the explicit disclaimer of formal guarantees.
Authors: We agree that the functional privacy claim requires careful framing. The manuscript already states that VE-MD provides no formal anonymization or cryptographic guarantees and is instead designed to avoid explicit individual monitoring by producing only group-level outputs. We will revise the abstract and add a dedicated paragraph in the introduction to explicitly distinguish functional design from formal privacy methods. We will also expand the limitations section to acknowledge the absence of reconstruction, membership-inference, or inversion attack evaluations. This constitutes a partial revision focused on clarification rather than new empirical privacy experiments. revision: partial
-
Referee: [Abstract] Abstract / Experimental section: SOTA numbers (90.06% on GAF-3.0, 82.25% on VGAF) are stated without reference to the experimental protocol, baseline implementations, ablation controls, or statistical significance tests, preventing verification that gains are attributable to the claimed VE-MD mechanisms rather than dataset-specific factors or multimodal fusion.
Authors: The experimental section (Section 4) already details the evaluation protocols, baseline re-implementations, ablation studies on the PersonQuery and Heatmap decoders, and reports results with standard deviations across multiple runs. To address the concern, we will revise the abstract to include concise references to the key experimental settings and cross-reference the relevant subsections. We will also ensure all performance claims are accompanied by explicit baseline comparisons in the main text. This is a straightforward revision to improve readability and verifiability. revision: yes
Circularity Check
No significant circularity: empirical results on external benchmarks with no derivations or self-referential fitting
full rationale
The manuscript contains no equations, derivations, or parameter-fitting steps that could reduce to self-definition. Performance claims (e.g., 90.06% on GAF-3.0) rest on direct evaluation against independent public datasets. The privacy-by-functional-design argument is an architectural choice (group-level outputs only, no per-person labels) rather than a derived result; it is not obtained by fitting or by self-citation chains. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided text. This is the normal case of a purely empirical architecture paper whose central claims are externally falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard variational inference and reconstruction objectives for encoder-decoder networks
Reference graph
Works this paper leans on
-
[1]
Engagement measure- ment based on facial landmarks and spatial-temporal graph convolutional networks
Ali Abedi and Shehroz S Khan. Engagement measure- ment based on facial landmarks and spatial-temporal graph convolutional networks. arXiv e-prints , pages arXiv–2403, 2024
work page 2024
-
[2]
How the gdpr will change the world
Jan Philipp Albrecht. How the gdpr will change the world. Eur. Data Prot. L. Rev., 2:287, 2016
work page 2016
-
[3]
Avinash Anand, Avni Mittal, Laavanaya Dhawan, Juhi Krishnamurthy, Mahisha Ramesh, Naman Lal, Astha Verma, Pijush Bhuyan, Raijv Ratn Shah, Roger Zimmer- mann, et al. Exceda: Unlocking attention paradigms in extended duration e-classrooms by leveraging attention- mechanism models. In2024 IEEE 7th International Con- ference on Multimedia Information Processi...
work page 2024
-
[4]
Multimodal Group Emotion Recog- nition In-the-wild Using Privacy-Compliant Features
Augusma, Anderson and Vaufreydaz, Dominique and Letué, Frédérique. Multimodal Group Emotion Recog- nition In-the-wild Using Privacy-Compliant Features. In Proceedings of the 25th International Conference on Multimodal Interaction, pages 750–754, 2023
work page 2023
-
[5]
wav2vec 2.0: A framework for self- 14 AUTHOR VERSION supervised learning of speech representations
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self- 14 AUTHOR VERSION supervised learning of speech representations. Advances in neural information processing systems , 33:12449– 12460, 2020
work page 2020
-
[6]
Samsemo: New dataset for multilingual and multimodal emotion recognition
Paweł Bujnowski, Bartłomiej Ku´ zma, Bartłomiej Paziewski, Jacek Rutkowski, Joanna Marhula, Zuzanna Bordzicka, and Piotr Andruszkiewicz. Samsemo: New dataset for multilingual and multimodal emotion recognition. In Interspeech, 2024
work page 2024
-
[7]
Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment prob- lem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE international conference on computer vision, pages 1021–1030, 2017
work page 2017
-
[8]
Manuel G Calvo, Andrés Fernández-Martín, Guillermo Recio, and Daniel Lundqvist. Human observers and au- tomated assessment of dynamic emotional facial expres- sions: Kdef-dyn database validation. Frontiers in psy- chology, 9:2052, 2018
work page 2052
-
[9]
The eu’s ai act: A framework for collaborative governance
Celso Cancela-Outeda. The eu’s ai act: A framework for collaborative governance. Internet of Things, 27:101291, 2024
work page 2024
-
[10]
Realtime multi-person 2d pose estimation using part affinity fields
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291– 7299, 2017
work page 2017
-
[11]
Openpose: Realtime multi-person 2d pose estimation using part affinity fields
Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE trans- actions on pattern analysis and machine intelligence, 43 (1):172–186, 2019
work page 2019
-
[12]
End-to-end object detection with transform- ers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transform- ers. In European conference on computer vision , pages 213–229. Springer, 2020
work page 2020
-
[13]
Règlement sur l’intelligence artificielle- version enrichie, 2024
Bertrand Cassar. Règlement sur l’intelligence artificielle- version enrichie, 2024
work page 2024
-
[14]
Finecliper: Multi-modal fine- grained clip for dynamic facial expression recognition with adapters
Haodong Chen, Haojian Huang, Junhao Dong, Mingzhe Zheng, and Dian Shao. Finecliper: Multi-modal fine- grained clip for dynamic facial expression recognition with adapters. In Proceedings of the 32nd ACM Inter- national Conference on Multimedia , pages 2301–2310, 2024
work page 2024
-
[15]
System de- scription for voice privacy challenge 2022
Xiaojiao Chen, Guangxing Li, Hao Huang, Wangjin Zhou, Sheng Li, Yang Cao, and Yi Zhao. System de- scription for voice privacy challenge 2022. In Proc. 2nd Symposium on Security and Privacy in Speech Commu- nication, 2022
work page 2022
-
[16]
Abhinav Dhall, Garima Sharma, Roland Goecke, and Tom Gedeon. Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based chal- lenges. In Proceedings of the 2020 International Confer- ence on Multimodal Interaction, pages 784–789, 2020
work page 2020
-
[17]
Emotiw 2023: Emotion recognition in the wild challenge
Abhinav Dhall, Monisha Singh, Roland Goecke, Tom Gedeon, Donghuo Zeng, Yanan Wang, and Kazushi Ikeda. Emotiw 2023: Emotion recognition in the wild challenge. In Proceedings of the 25th International Con- ference on Multimodal Interaction (ICMI 2023), 2023
work page 2023
-
[18]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021
work page 2021
-
[19]
Training generative neural networks via maximum mean discrepancy optimization
Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In Proceed- ings of the Thirty-First Conference on Uncertainty in Ar- tificial Intelligence, pages 258–267, 2015
work page 2015
-
[20]
Ebner, Michaela Riediger, and Ulman Lindenberger
Natalie C. Ebner, Michaela Riediger, and Ulman Lindenberger. Faces-a database of facial expres- sions in young, middle-aged, and older women and men: Development and validation. Behavior Re- search Methods, 42:351–362, 2 2010. ISSN 1554351X. doi:10.3758/BRM.42.1.351
-
[21]
Multi-task learning on the edge for ef- fective gender, age, ethnicity and emotion recognition
Pasquale Foggia, Antonio Greco, Alessia Saggese, and Mario Vento. Multi-task learning on the edge for ef- fective gender, age, ethnicity and emotion recognition. Engineering Applications of Artificial Intelligence , 118: 105651, 2023
work page 2023
-
[22]
Emoclip: A vision-language method for zero-shot video facial ex- pression recognition
Niki Maria Foteinopoulou and Ioannis Patras. Emoclip: A vision-language method for zero-shot video facial ex- pression recognition. In 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), pages 1–10. IEEE, 2024
work page 2024
-
[23]
Xin Guo, Luisa Polania, Bin Zhu, Charles Boncelet, and Kenneth Barner. Graph neural networks for image under- standing based on multiple cues: Group emotion recog- nition and event recognition as use cases. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2921–2930, 2020
work page 2020
-
[24]
An attention model for group-level emotion recognition
Aarush Gupta, Dakshit Agrawal, Hardik Chauhan, Jose Dolz, and Marco Pedersoli. An attention model for group-level emotion recognition. In Proceedings of the 20th ACM International Conference on Multimodal In- teraction, pages 611–615, 2018
work page 2018
-
[25]
Multimodal face-pose estimation with multitask manifold deep learning
Chaoqun Hong, Jun Yu, Jian Zhang, Xiongnan Jin, and Kyong-Ho Lee. Multimodal face-pose estimation with multitask manifold deep learning. IEEE transactions on industrial informatics, 15(7):3952–3961, 2018
work page 2018
-
[26]
Deep multi-task 15 AUTHOR VERSION learning to recognise subtle facial expressions of men- tal states
Guosheng Hu, Li Liu, Yang Yuan, Zehao Yu, Yang Hua, Zhihong Zhang, Fumin Shen, Ling Shao, Timo- thy Hospedales, Neil Robertson, et al. Deep multi-task 15 AUTHOR VERSION learning to recognise subtle facial expressions of men- tal states. In Proceedings of the European conference on computer vision (ECCV), pages 103–119, 2018
work page 2018
-
[27]
Psmf: Prototype network subgraph with multi-head attention framework for group emotion recognition
Wenti Huang, Jun Long, et al. Psmf: Prototype network subgraph with multi-head attention framework for group emotion recognition. Not published yet (Rewiew) , page 121969, 2025
work page 2025
-
[28]
Dfew: A large-scale database for recognizing dynamic facial ex- pressions in the wild
Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, and Jiateng Liu. Dfew: A large-scale database for recognizing dynamic facial ex- pressions in the wild. In Proceedings of the 28th ACM International Conference on Multimedia , pages 2881– 2889, 2020
work page 2020
-
[29]
BeomJun Jo and SeongKi Kim. Comparative analysis of openpose, posenet, and movenet models for pose estima- tion in mobile devices. Traitement du Signal, 39(1):119, 2022
work page 2022
-
[30]
Joint fine-tuning in deep neural net- works for facial expression recognition
Heechul Jung, Sihaeng Lee, Junho Yim, Sunjeong Park, and Junmo Kim. Joint fine-tuning in deep neural net- works for facial expression recognition. In Proceedings of the IEEE international conference on computer vision, pages 2983–2991, 2015
work page 2015
-
[31]
Object detection in real time based on improved sin- gle shot multi-box detector algorithm
Ashwani Kumar, Zuopeng Justin Zhang, and Hongbo Lyu. Object detection in real time based on improved sin- gle shot multi-box detector algorithm. EURASIP Journal on Wireless Communications and Networking , 2020(1): 204, 2020
work page 2020
-
[32]
Fusing multimodal streams for improved group emotion recognition in videos
Deepak Kumar, Piyush Dhamdhere, and Balasubrama- nian Raman. Fusing multimodal streams for improved group emotion recognition in videos. In International Conference on Pattern Recognition , pages 403–418. Springer, 2025
work page 2025
-
[33]
Exploring vq-vae with prosody pa- rameters for speaker anonymization
Sotheara Leang, Anderson Augusma, Eric Castelli, Frédérique Letué, Sethserey Sam, and Dominique Vaufreydaz. Exploring vq-vae with prosody pa- rameters for speaker anonymization. arXiv preprint arXiv:2409.15882, 2024
-
[34]
Mer 2023: Multi-label learning, modal- ity robustness, and semi-supervised learning
Zheng Lian, Haiyang Sun, Licai Sun, Kang Chen, Mngyu Xu, Kexin Wang, Ke Xu, Yu He, Ying Li, Jin- ming Zhao, et al. Mer 2023: Multi-label learning, modal- ity robustness, and semi-supervised learning. InProceed- ings of the 31st ACM International Conference on Multi- media, pages 9610–9614, 2023
work page 2023
-
[35]
Group level audio-video emotion recog- nition using hybrid networks
Chuanhe Liu, Wenqiang Jiang, Minghao Wang, and Tianhao Tang. Group level audio-video emotion recog- nition using hybrid networks. In Proceedings of the 2020 International Conference on Multimodal Interac- tion (ICMI 2020), pages 807–812. Association for Com- puting Machinery, Inc, 10 2020. ISBN 9781450375818. doi:10.1145/3382507.3417968
-
[36]
All rivers run into the sea: Unified modality brain-inspired emotional central mech- anism
Xinji Mai, Junxiong Lin, Haoran Wang, Zeng Tao, Yan Wang, Shaoqi Yan, Xuan Tong, Jiawen Yu, Boyang Wang, Ziheng Zhou, et al. All rivers run into the sea: Unified modality brain-inspired emotional central mech- anism. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 632–641, 2024
work page 2024
-
[37]
Ous: Scene-guided dynamic facial ex- pression recognition
Xinji Mai, Haoran Wang, Zeng Tao, Junxiong Lin, Shaoqi Yan, Yan Wang, Jing Liu, Jiawen Yu, Xuan Tong, Yating Li, et al. Ous: Scene-guided dynamic facial ex- pression recognition. CoRR, 2024
work page 2024
-
[38]
Facial landmark-based emotion recognition via directed graph neural network
Quang Tran Ngoc, Seunghyun Lee, and Byung Cheol Song. Facial landmark-based emotion recognition via directed graph neural network. Electronics, 9(5):764, 2020
work page 2020
-
[39]
Attentive statistics pooling for deep speaker embedding
Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda. Attentive statistics pooling for deep speaker embedding. arXiv preprint arXiv:1803.10963, 2018
-
[40]
Real-time 2d multi-person pose estimation on cpu: Lightweight openpose
D Osokin. Real-time 2d multi-person pose estimation on cpu: Lightweight openpose. In ICPRAM 2019- Proceedings of the 8th International Conference on Pat- tern Recognition Applications and Methods, pages 744– 748, 2019
work page 2019
-
[41]
Privacy-preserving video classification with convolu- tional neural networks
Sikha Pentyala, Rafael Dowsley, and Martine De Cock. Privacy-preserving video classification with convolu- tional neural networks. In International conference on machine learning, pages 8487–8499. PMLR, 2021
work page 2021
-
[42]
Multitask, multilabel, and multidomain learning with convolutional networks for emotion recognition
Gerard Pons and David Masip. Multitask, multilabel, and multidomain learning with convolutional networks for emotion recognition. IEEE Transactions on Cyber- netics, 52(6):4764–4771, 2020
work page 2020
-
[43]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Inter- national conference on machine learning, pages 28492– 28518. PMLR, 2023
work page 2023
-
[44]
Rajeev Ranjan, Vishal M Patel, and Rama Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE transactions on pattern analysis and machine intelligence, 41(1):121–135, 2017
work page 2017
-
[45]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medi- cal image computing and computer-assisted interven- tion, pages 234–241. Springer, 2015
work page 2015
-
[46]
Neural network model for video-based analysis of student’s emotions in e-learning
Andrey V Savchenko and IA Makarov. Neural network model for video-based analysis of student’s emotions in e-learning. Optical Memory and Neural Networks, 31(3): 237–244, 2022
work page 2022
-
[47]
Audio- visual automatic group affect analysis
Garima Sharma, Abhinav Dhall, and Jianfei Cai. Audio- visual automatic group affect analysis. IEEE Transac- tions on Affective Computing, 2021. 16 AUTHOR VERSION
work page 2021
-
[48]
End-to-end multi-person pose estimation with trans- formers
Dahu Shi, Xing Wei, Liangqi Li, Ye Ren, and Wenming Tan. End-to-end multi-person pose estimation with trans- formers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11069– 11078, 2022
work page 2022
-
[49]
Do i have your attention: A large scale engagement prediction dataset and baselines
Monisha Singh, Ximi Hoque, Donghuo Zeng, Yanan Wang, Kazushi Ikeda, and Abhinav Dhall. Do i have your attention: A large scale engagement prediction dataset and baselines. In Proceedings of the 25th International Conference on Multimodal Interaction , pages 174–182, 2023
work page 2023
-
[50]
Multi-modal fusion us- ing spatio-temporal and static features for group emotion recognition
Mo Sun, Jian Li, Hui Feng, Wei Gou, Haifeng Shen, Jian Tang, Yi Yang, and Jieping Ye. Multi-modal fusion us- ing spatio-temporal and static features for group emotion recognition. Proceedings of the 2020 International Con- ference on Multimodal Interaction (ICMI 2020) , pages 835–840, 10 2020. doi:10.1145/3382507.3417971
-
[51]
Group emotion recogni- tion with individual facial emotion cnns and global image based cnns
Lianzhi Tan, Kaipeng Zhang, Kai Wang, Xiaoxing Zeng, Xiaojiang Peng, and Yu Qiao. Group emotion recogni- tion with individual facial emotion cnns and global image based cnns. In Proceedings of the 19th ACM interna- tional conference on multimodal interaction, pages 549– 552, 2017
work page 2017
-
[52]
Zeng Tao, Yan Wang, Junxiong Lin, Haoran Wang, Xinji Mai, Jiawen Yu, Xuan Tong, Ziheng Zhou, Shaoqi Yan, Qing Zhao, et al. Align-dfer: Pioneering comprehensive dynamic affective alignment for dynamic facial expres- sion recognition with clip. CoRR, 2024
work page 2024
-
[53]
Alexander Vedernikov, Puneet Kumar, Haoyu Chen, Tapio Seppänen, and Xiaobai Li. Tcct-net: Two-stream network architecture for fast and efficient engagement es- timation via behavioral feature signals. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR) Workshops, pages 4723–4732, June 2024
work page 2024
-
[54]
Social signal processing: Survey of an emerging domain
Alessandro Vinciarelli, Maja Pantic, and Hervé Bourlard. Social signal processing: Survey of an emerging domain. Image and vision computing, 27(12):1743–1759, 2009
work page 2009
-
[55]
Hierarchical audio-visual information fusion with multi-label joint decoding for mer 2023
Haotian Wang, Yuxuan Xi, Hang Chen, Jun Du, Yan Song, Qing Wang, Hengshun Zhou, Chenxi Wang, Jiefeng Ma, Pengfei Hu, et al. Hierarchical audio-visual information fusion with multi-label joint decoding for mer 2023. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9531–9535, 2023
work page 2023
-
[56]
Haotian Wang, Jun Du, Yusheng Dai, Chin-Hui Lee, Yuling Ren, and Yu Liu. Improving multi-modal emo- tion recognition using entropy-based fusion and pruning- based network architecture optimization. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 11766– 11770. IEEE, 2024
work page 2024
-
[57]
Cas- cade attention networks for group emotion recognition with face, body and image cues
Kai Wang, Xiaoxing Zeng, Jianfei Yang, Debin Meng, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. Cas- cade attention networks for group emotion recognition with face, body and image cues. In Proceedings of the 20th ACM international conference on multimodal inter- action, pages 640–645, 2018
work page 2018
-
[58]
Linhuang Wang, Xin Kang, Fei Ding, Satoshi Nakagawa, and Fuji Ren. A joint local spatial and global temporal cnn-transformer for dynamic facial expression recogni- tion. Applied Soft Computing, 161:111680, 2024
work page 2024
-
[59]
Quanyu Wang, Kaixiang Zhang, and Manjotho Ali As- ghar. Skeleton-based st-gcn for human action recognition with extended skeleton graph and partitioning strategy. IEEE Access, 10:41403–41410, 2022
work page 2022
-
[60]
Vitpose: Simple vision transformer baselines for human pose estimation
Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation. Advances in neural information pro- cessing systems, 35:38571–38584, 2022
work page 2022
-
[61]
Multi- clue fusion for emotion recognition in the wild
Jingwei Yan, Wenming Zheng, Zhen Cui, Chuangao Tang, Tong Zhang, Yuan Zong, and Ning Sun. Multi- clue fusion for emotion recognition in the wild. In Pro- ceedings of the 18th ACM International Conference on Multimodal Interaction, pages 458–463, 2016
work page 2016
-
[62]
Npu-ntu system for voice privacy 2024 chal- lenge
Jixun Yao, Nikita Kuzmin, Qing Wang, Pengcheng Guo, Ziqian Ning, Dake Guo, Kong Aik Lee, Eng-Siong Chng, and Lei Xie. Npu-ntu system for voice privacy 2024 chal- lenge. arXiv preprint arXiv:2409.04173, 2024
-
[63]
Multi-task convolu- tional neural network for pose-invariant face recognition
Xi Yin and Xiaoming Liu. Multi-task convolu- tional neural network for pose-invariant face recognition. IEEE Transactions on Image Processing, 27(2):964–975, 2017
work page 2017
-
[64]
LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop
Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: construction of a large-scale im- age dataset using deep learning with humans in the loop. CoRR, abs/1506.03365, 2015. URL http://arxiv. org/abs/1506.03365
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[65]
Zheyu Zhang and Seong-Yoon Shin. Two-dimensional human pose estimation with deep learning: A review.Ap- plied Sciences, 15(13):7344, 2025
work page 2025
-
[66]
Qing Zhu, Qirong Mao, Wenlong Dong, Xiuyan Shao, Xiaohua Huang, and Wenming Zheng. Adaptive key role guided hierarchical relation inference for enhanced group-level emotion recognition. IEEE Transactions on Affective Computing, 2025
work page 2025
-
[67]
Privacy aware affec- tive state recognition from visual data
M Sami Zitouni, Peter Lee, Uichin Lee, Leontios J Had- jileontiadis, and Ahsan Khandoker. Privacy aware affec- tive state recognition from visual data. IEEE Access, 10: 40620–40628, 2022
work page 2022
-
[68]
Daoming Zong, Chaoyue Ding, Baoxiang Li, Dinghao Zhou, Jiakui Li, Ken Zheng, and Qunyan Zhou. Build- ing robust multimodal sentiment recognition via a simple 17 AUTHOR VERSION Ground Truth Skeleton PredictedSkeleton Ground Truth Skeleton PredictedSkeleton Figure 7: Qualitative examples of predicted structural repre- sentations (SR) obtained with VE-MD-Per...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.