pith. machine review for the scientific record. sign in

arxiv: 2604.02397 · v1 · submitted 2026-04-02 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Variational Encoder--Multi-Decoder (VE-MD) for Privacy-by-functional-design (Group) Emotion Recognition

Anderson Augusma (UGA, Dominique Vaufreydaz (LIG, F\'ed\'erique Letu\'e (SVH), LIG, M-PSI)

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords group emotion recognitionprivacy by functional designvariational encodermulti-decoderstructural representationscollective affectin-the-wild datasetsmultimodal fusion
0
0 comments X

The pith

VE-MD framework recognizes group emotions at state-of-the-art levels by outputting only aggregate affect from shared structural representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VE-MD, a variational encoder with multiple decoders that learns a shared latent representation jointly optimized for group emotion classification and internal prediction of body and facial structures. This setup avoids any per-person emotion labels or identity outputs, relying on functional design to limit the pipeline to collective affect inference in settings like crowds or classrooms. Structural decoding, via either a transformer PersonQuery or dense heatmap decoder, preserves interaction cues that latent optimization alone tends to attenuate in group tasks. Results show this yields up to 90.06 percent accuracy on GAF-3.0 and 82.25 percent on VGAF with audio, while the same structural outputs serve as a denoising step on individual emotion benchmarks.

Core claim

VE-MD constrains outputs to aggregate group-level affect by training a variational encoder on a latent space that supports decoding of variable-size body and face structures; explicit structural supervision maintains interaction-related information needed for collective inference, whereas latent space optimization by itself attenuates those cues in group emotion recognition but provides denoising benefits in individual emotion recognition.

What carries the argument

Variational Encoder-Multi-Decoder (VE-MD) architecture whose decoders (transformer PersonQuery or dense Heatmap) predict body and facial structural representations to regularize the shared latent space for group affect classification.

If this is right

  • Structural supervision consistently raises performance on group emotion datasets by retaining interaction cues.
  • The same structural outputs improve individual emotion recognition by acting as a denoising bottleneck.
  • Multimodal fusion with audio reaches 82.25 percent on VGAF and outperforms prior work on SamSemo at 77.9 percent.
  • Group-level tasks require explicit structural preservation, unlike individual tasks where latent optimization suffices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same encoder-decoder split could be applied to other collective behaviors such as group attention or crowd dynamics without adding per-person outputs.
  • Deployment in regulated environments would still need separate verification that latent features cannot be inverted to recover identities.
  • Replacing the structural decoders with purely latent constraints would likely reduce accuracy on group datasets while preserving the privacy property.
  • The observed GER versus IER performance gap suggests that collective affect modeling benefits from mechanisms that explicitly encode pairwise or spatial relations.

Load-bearing premise

Avoiding explicit per-person emotion or identity outputs is enough to deliver meaningful privacy protection when the model is deployed on real group data.

What would settle it

An adversary successfully reconstructing individual identities or per-person emotions from the latent representations or decoder outputs on a held-out test set would falsify the privacy claim.

Figures

Figures reproduced from arXiv: 2604.02397 by Anderson Augusma (UGA, Dominique Vaufreydaz (LIG, F\'ed\'erique Letu\'e (SVH), LIG, M-PSI).

Figure 1
Figure 1. Figure 1: Overview of the proposed VE-MD architecture. Left: input data, consisting of video frames or a single image. Middle: the Variational Encoder (VE, green box), which learns two latent spaces: Z1 for emotion recognition and Z2 for joint optimization of emotion recognition and person structural representation. Right: the multi-decoder head, where the Emotion Decoder uses Z1 and Z2, optionally complemented by s… view at source ↗
Figure 2
Figure 2. Figure 2: PersonQuery decoder. Left: the latent space is processed by an auxiliary convolutional module that produces three multi-scale feature tensors, F1, F2, and F3, which are flattened and fed to the transformer encoder. Right: the trans￾former decoder receives target queries together with the en￾coded features and predicts the structural representation and the adjacency matrix through a multi-layer perceptron h… view at source ↗
Figure 3
Figure 3. Figure 3: Heatmap decoder. At the top is the VE latent￾space input, followed by a custom UNet-upsample network, and the limbs decoder is applied to predict the limb heatmap. The same process is applied to the structural representation of faces. direct structural representations through limb-connection es￾timation inspired by OpenPose [11]. In contrast to Person￾Query, the Heatmap decoder naturally adapts to any numb… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of datasets used in this study. GAF-3.0 and VGAF at left illustrate Group Emotion Recognition (GER) datasets, followed by examples from IER datasets: DFEW, SAMSEMO, MER-MULTI, and EngageNet [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of automatic data annotation for structural representation. The first line corresponds to body annotations using ViTPose with 18 limb connections (COCO-style). The second line corresponds to face annotations using FaceAlign￾ment with 20 custom limb connections. datasets, when the scene is crowded, or when framing, dis￾tance, and lighting conditions are not adequate, automatic annotation can generat… view at source ↗
Figure 6
Figure 6. Figure 6: Summary of VE-MD versus SOTA across datasets. The first two columns correspond to GER datasets. Aster￾isks (*) indicates a statistically significant difference versus the SOTA using a two-sided McNemar test (p < 0.05). 9 Conclusion This paper introduced the Variational Encoder–Multi-Decoder (VE-MD) framework for group emotion recognition under a privacy-aware functional design. Rather than relying on iden￾… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative examples of predicted structural repre￾sentations (SR) obtained with VE-MD-PersonQuery on MER￾MULTI using a query value of 6. The model predicts more structures than those aligned with the ground truth, illustrat￾ing how noisy or redundant SR predictions can arise. yet effective multimodal transformer. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9596–9600, 2023.… view at source ↗
read the original abstract

Group Emotion Recognition (GER) aims to infer collective affect in social environments such as classrooms, crowds, and public events. Many existing approaches rely on explicit individual-level processing, including cropped faces, person tracking, or per-person feature extraction, which makes the analysis pipeline person-centric and raises privacy concerns in deployment scenarios where only group-level understanding is needed. This research proposes VE-MD, a Variational Encoder-Multi-Decoder framework for group emotion recognition under a privacy-aware functional design. Rather than providing formal anonymization or cryptographic privacy guarantees, VE-MD is designed to avoid explicit individual monitoring by constraining the model to predict only aggregate group-level affect, without identity recognition or per-person emotion outputs. VE-MD learns a shared latent representation jointly optimized for emotion classification and internal prediction of body and facial structural representations. Two structural decoding strategies are investigated: a transformer-based PersonQuery decoder and a dense Heatmap decoder that naturally accommodates variable group sizes. Experiments on six in-the-wild datasets, including two GER and four Individual Emotion Recognition (IER) benchmarks, show that structural supervision consistently improves representation learning. More importantly, the results reveal a clear distinction between GER and IER: optimizing the latent space alone is often insufficient for GER because it tends to attenuate interaction-related cues, whereas preserving explicit structural outputs improves collective affect inference. In contrast, projected structural representations seem to act as an effective denoising bottleneck for IER. VE-MD achieves state-of-the-art performance on GAF-3.0 (up to 90.06%) and VGAF (82.25% with multimodal fusion with audio). These results show that preserving interaction-related structural information is particularly beneficial for group-level affect modeling without relying on prior individual feature extraction. On IER datasets using multimodal fusion with audio modality, VE-MD outperforms SOTA on SamSemo (77.9%, adding text modality) while achieving competitive performances on MER-MULTI (63.8%), DFEW (70.7%) and EngageNet (69.0).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes VE-MD, a Variational Encoder-Multi-Decoder framework for group emotion recognition (GER) under a privacy-by-functional-design approach. It learns a shared latent representation jointly optimized for group-level affect classification and structural decoding of faces/bodies (via PersonQuery transformer or Heatmap decoder), explicitly avoiding per-person emotion or identity outputs. Experiments across six in-the-wild datasets report SOTA results on GER benchmarks (90.06% on GAF-3.0; 82.25% on VGAF with audio fusion) and competitive performance on individual emotion recognition (IER) tasks, arguing that explicit structural outputs preserve interaction cues beneficial for GER but act as a denoising bottleneck for IER.

Significance. If the empirical results and the functional privacy premise hold under scrutiny, the work would provide a practical route to privacy-aware GER in sensitive settings (classrooms, crowds) without relying on explicit individual processing or cryptographic primitives. The reported distinction between GER and IER regarding the utility of structural supervision is a potentially useful insight for representation learning in affective computing.

major comments (2)
  1. [Abstract] Abstract: The privacy-by-functional-design claim is load-bearing for the paper's contribution yet rests on the unverified premise that group-level outputs plus structural decoders (PersonQuery/Heatmap) cannot leak individual identity or per-person affect; no reconstruction metrics, membership-inference results, or inversion attacks on the latent space are reported despite the explicit disclaimer of formal guarantees.
  2. [Abstract] Abstract / Experimental section: SOTA numbers (90.06% on GAF-3.0, 82.25% on VGAF) are stated without reference to the experimental protocol, baseline implementations, ablation controls, or statistical significance tests, preventing verification that gains are attributable to the claimed VE-MD mechanisms rather than dataset-specific factors or multimodal fusion.
minor comments (1)
  1. The motivation for why structural supervision attenuates interaction cues in GER but denoises IER could be strengthened with a brief comparison to prior interaction-modeling literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below with clarifications and proposed revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The privacy-by-functional-design claim is load-bearing for the paper's contribution yet rests on the unverified premise that group-level outputs plus structural decoders (PersonQuery/Heatmap) cannot leak individual identity or per-person affect; no reconstruction metrics, membership-inference results, or inversion attacks on the latent space are reported despite the explicit disclaimer of formal guarantees.

    Authors: We agree that the functional privacy claim requires careful framing. The manuscript already states that VE-MD provides no formal anonymization or cryptographic guarantees and is instead designed to avoid explicit individual monitoring by producing only group-level outputs. We will revise the abstract and add a dedicated paragraph in the introduction to explicitly distinguish functional design from formal privacy methods. We will also expand the limitations section to acknowledge the absence of reconstruction, membership-inference, or inversion attack evaluations. This constitutes a partial revision focused on clarification rather than new empirical privacy experiments. revision: partial

  2. Referee: [Abstract] Abstract / Experimental section: SOTA numbers (90.06% on GAF-3.0, 82.25% on VGAF) are stated without reference to the experimental protocol, baseline implementations, ablation controls, or statistical significance tests, preventing verification that gains are attributable to the claimed VE-MD mechanisms rather than dataset-specific factors or multimodal fusion.

    Authors: The experimental section (Section 4) already details the evaluation protocols, baseline re-implementations, ablation studies on the PersonQuery and Heatmap decoders, and reports results with standard deviations across multiple runs. To address the concern, we will revise the abstract to include concise references to the key experimental settings and cross-reference the relevant subsections. We will also ensure all performance claims are accompanied by explicit baseline comparisons in the main text. This is a straightforward revision to improve readability and verifiability. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical results on external benchmarks with no derivations or self-referential fitting

full rationale

The manuscript contains no equations, derivations, or parameter-fitting steps that could reduce to self-definition. Performance claims (e.g., 90.06% on GAF-3.0) rest on direct evaluation against independent public datasets. The privacy-by-functional-design argument is an architectural choice (group-level outputs only, no per-person labels) rather than a derived result; it is not obtained by fitting or by self-citation chains. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided text. This is the normal case of a purely empirical architecture paper whose central claims are externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond standard variational auto-encoder assumptions; the central claim rests on empirical performance rather than new theoretical constructs.

axioms (1)
  • standard math Standard variational inference and reconstruction objectives for encoder-decoder networks
    Implicit reliance on VAE training assumptions common in the literature.

pith-pipeline@v0.9.0 · 5712 in / 1182 out tokens · 37871 ms · 2026-05-13T21:49:43.971158+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 1 internal anchor

  1. [1]

    Engagement measure- ment based on facial landmarks and spatial-temporal graph convolutional networks

    Ali Abedi and Shehroz S Khan. Engagement measure- ment based on facial landmarks and spatial-temporal graph convolutional networks. arXiv e-prints , pages arXiv–2403, 2024

  2. [2]

    How the gdpr will change the world

    Jan Philipp Albrecht. How the gdpr will change the world. Eur. Data Prot. L. Rev., 2:287, 2016

  3. [3]

    Exceda: Unlocking attention paradigms in extended duration e-classrooms by leveraging attention- mechanism models

    Avinash Anand, Avni Mittal, Laavanaya Dhawan, Juhi Krishnamurthy, Mahisha Ramesh, Naman Lal, Astha Verma, Pijush Bhuyan, Raijv Ratn Shah, Roger Zimmer- mann, et al. Exceda: Unlocking attention paradigms in extended duration e-classrooms by leveraging attention- mechanism models. In2024 IEEE 7th International Con- ference on Multimedia Information Processi...

  4. [4]

    Multimodal Group Emotion Recog- nition In-the-wild Using Privacy-Compliant Features

    Augusma, Anderson and Vaufreydaz, Dominique and Letué, Frédérique. Multimodal Group Emotion Recog- nition In-the-wild Using Privacy-Compliant Features. In Proceedings of the 25th International Conference on Multimodal Interaction, pages 750–754, 2023

  5. [5]

    wav2vec 2.0: A framework for self- 14 AUTHOR VERSION supervised learning of speech representations

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self- 14 AUTHOR VERSION supervised learning of speech representations. Advances in neural information processing systems , 33:12449– 12460, 2020

  6. [6]

    Samsemo: New dataset for multilingual and multimodal emotion recognition

    Paweł Bujnowski, Bartłomiej Ku´ zma, Bartłomiej Paziewski, Jacek Rutkowski, Joanna Marhula, Zuzanna Bordzicka, and Piotr Andruszkiewicz. Samsemo: New dataset for multilingual and multimodal emotion recognition. In Interspeech, 2024

  7. [7]

    How far are we from solving the 2d & 3d face alignment prob- lem?(and a dataset of 230,000 3d facial landmarks)

    Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment prob- lem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE international conference on computer vision, pages 1021–1030, 2017

  8. [8]

    Human observers and au- tomated assessment of dynamic emotional facial expres- sions: Kdef-dyn database validation

    Manuel G Calvo, Andrés Fernández-Martín, Guillermo Recio, and Daniel Lundqvist. Human observers and au- tomated assessment of dynamic emotional facial expres- sions: Kdef-dyn database validation. Frontiers in psy- chology, 9:2052, 2018

  9. [9]

    The eu’s ai act: A framework for collaborative governance

    Celso Cancela-Outeda. The eu’s ai act: A framework for collaborative governance. Internet of Things, 27:101291, 2024

  10. [10]

    Realtime multi-person 2d pose estimation using part affinity fields

    Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291– 7299, 2017

  11. [11]

    Openpose: Realtime multi-person 2d pose estimation using part affinity fields

    Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE trans- actions on pattern analysis and machine intelligence, 43 (1):172–186, 2019

  12. [12]

    End-to-end object detection with transform- ers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transform- ers. In European conference on computer vision , pages 213–229. Springer, 2020

  13. [13]

    Règlement sur l’intelligence artificielle- version enrichie, 2024

    Bertrand Cassar. Règlement sur l’intelligence artificielle- version enrichie, 2024

  14. [14]

    Finecliper: Multi-modal fine- grained clip for dynamic facial expression recognition with adapters

    Haodong Chen, Haojian Huang, Junhao Dong, Mingzhe Zheng, and Dian Shao. Finecliper: Multi-modal fine- grained clip for dynamic facial expression recognition with adapters. In Proceedings of the 32nd ACM Inter- national Conference on Multimedia , pages 2301–2310, 2024

  15. [15]

    System de- scription for voice privacy challenge 2022

    Xiaojiao Chen, Guangxing Li, Hao Huang, Wangjin Zhou, Sheng Li, Yang Cao, and Yi Zhao. System de- scription for voice privacy challenge 2022. In Proc. 2nd Symposium on Security and Privacy in Speech Commu- nication, 2022

  16. [16]

    Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based chal- lenges

    Abhinav Dhall, Garima Sharma, Roland Goecke, and Tom Gedeon. Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based chal- lenges. In Proceedings of the 2020 International Confer- ence on Multimodal Interaction, pages 784–789, 2020

  17. [17]

    Emotiw 2023: Emotion recognition in the wild challenge

    Abhinav Dhall, Monisha Singh, Roland Goecke, Tom Gedeon, Donghuo Zeng, Yanan Wang, and Kazushi Ikeda. Emotiw 2023: Emotion recognition in the wild challenge. In Proceedings of the 25th International Con- ference on Multimodal Interaction (ICMI 2023), 2023

  18. [18]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021

  19. [19]

    Training generative neural networks via maximum mean discrepancy optimization

    Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In Proceed- ings of the Thirty-First Conference on Uncertainty in Ar- tificial Intelligence, pages 258–267, 2015

  20. [20]

    Ebner, Michaela Riediger, and Ulman Lindenberger

    Natalie C. Ebner, Michaela Riediger, and Ulman Lindenberger. Faces-a database of facial expres- sions in young, middle-aged, and older women and men: Development and validation. Behavior Re- search Methods, 42:351–362, 2 2010. ISSN 1554351X. doi:10.3758/BRM.42.1.351

  21. [21]

    Multi-task learning on the edge for ef- fective gender, age, ethnicity and emotion recognition

    Pasquale Foggia, Antonio Greco, Alessia Saggese, and Mario Vento. Multi-task learning on the edge for ef- fective gender, age, ethnicity and emotion recognition. Engineering Applications of Artificial Intelligence , 118: 105651, 2023

  22. [22]

    Emoclip: A vision-language method for zero-shot video facial ex- pression recognition

    Niki Maria Foteinopoulou and Ioannis Patras. Emoclip: A vision-language method for zero-shot video facial ex- pression recognition. In 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), pages 1–10. IEEE, 2024

  23. [23]

    Graph neural networks for image under- standing based on multiple cues: Group emotion recog- nition and event recognition as use cases

    Xin Guo, Luisa Polania, Bin Zhu, Charles Boncelet, and Kenneth Barner. Graph neural networks for image under- standing based on multiple cues: Group emotion recog- nition and event recognition as use cases. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2921–2930, 2020

  24. [24]

    An attention model for group-level emotion recognition

    Aarush Gupta, Dakshit Agrawal, Hardik Chauhan, Jose Dolz, and Marco Pedersoli. An attention model for group-level emotion recognition. In Proceedings of the 20th ACM International Conference on Multimodal In- teraction, pages 611–615, 2018

  25. [25]

    Multimodal face-pose estimation with multitask manifold deep learning

    Chaoqun Hong, Jun Yu, Jian Zhang, Xiongnan Jin, and Kyong-Ho Lee. Multimodal face-pose estimation with multitask manifold deep learning. IEEE transactions on industrial informatics, 15(7):3952–3961, 2018

  26. [26]

    Deep multi-task 15 AUTHOR VERSION learning to recognise subtle facial expressions of men- tal states

    Guosheng Hu, Li Liu, Yang Yuan, Zehao Yu, Yang Hua, Zhihong Zhang, Fumin Shen, Ling Shao, Timo- thy Hospedales, Neil Robertson, et al. Deep multi-task 15 AUTHOR VERSION learning to recognise subtle facial expressions of men- tal states. In Proceedings of the European conference on computer vision (ECCV), pages 103–119, 2018

  27. [27]

    Psmf: Prototype network subgraph with multi-head attention framework for group emotion recognition

    Wenti Huang, Jun Long, et al. Psmf: Prototype network subgraph with multi-head attention framework for group emotion recognition. Not published yet (Rewiew) , page 121969, 2025

  28. [28]

    Dfew: A large-scale database for recognizing dynamic facial ex- pressions in the wild

    Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, and Jiateng Liu. Dfew: A large-scale database for recognizing dynamic facial ex- pressions in the wild. In Proceedings of the 28th ACM International Conference on Multimedia , pages 2881– 2889, 2020

  29. [29]

    Comparative analysis of openpose, posenet, and movenet models for pose estima- tion in mobile devices

    BeomJun Jo and SeongKi Kim. Comparative analysis of openpose, posenet, and movenet models for pose estima- tion in mobile devices. Traitement du Signal, 39(1):119, 2022

  30. [30]

    Joint fine-tuning in deep neural net- works for facial expression recognition

    Heechul Jung, Sihaeng Lee, Junho Yim, Sunjeong Park, and Junmo Kim. Joint fine-tuning in deep neural net- works for facial expression recognition. In Proceedings of the IEEE international conference on computer vision, pages 2983–2991, 2015

  31. [31]

    Object detection in real time based on improved sin- gle shot multi-box detector algorithm

    Ashwani Kumar, Zuopeng Justin Zhang, and Hongbo Lyu. Object detection in real time based on improved sin- gle shot multi-box detector algorithm. EURASIP Journal on Wireless Communications and Networking , 2020(1): 204, 2020

  32. [32]

    Fusing multimodal streams for improved group emotion recognition in videos

    Deepak Kumar, Piyush Dhamdhere, and Balasubrama- nian Raman. Fusing multimodal streams for improved group emotion recognition in videos. In International Conference on Pattern Recognition , pages 403–418. Springer, 2025

  33. [33]

    Exploring vq-vae with prosody pa- rameters for speaker anonymization

    Sotheara Leang, Anderson Augusma, Eric Castelli, Frédérique Letué, Sethserey Sam, and Dominique Vaufreydaz. Exploring vq-vae with prosody pa- rameters for speaker anonymization. arXiv preprint arXiv:2409.15882, 2024

  34. [34]

    Mer 2023: Multi-label learning, modal- ity robustness, and semi-supervised learning

    Zheng Lian, Haiyang Sun, Licai Sun, Kang Chen, Mngyu Xu, Kexin Wang, Ke Xu, Yu He, Ying Li, Jin- ming Zhao, et al. Mer 2023: Multi-label learning, modal- ity robustness, and semi-supervised learning. InProceed- ings of the 31st ACM International Conference on Multi- media, pages 9610–9614, 2023

  35. [35]

    Group level audio-video emotion recog- nition using hybrid networks

    Chuanhe Liu, Wenqiang Jiang, Minghao Wang, and Tianhao Tang. Group level audio-video emotion recog- nition using hybrid networks. In Proceedings of the 2020 International Conference on Multimodal Interac- tion (ICMI 2020), pages 807–812. Association for Com- puting Machinery, Inc, 10 2020. ISBN 9781450375818. doi:10.1145/3382507.3417968

  36. [36]

    All rivers run into the sea: Unified modality brain-inspired emotional central mech- anism

    Xinji Mai, Junxiong Lin, Haoran Wang, Zeng Tao, Yan Wang, Shaoqi Yan, Xuan Tong, Jiawen Yu, Boyang Wang, Ziheng Zhou, et al. All rivers run into the sea: Unified modality brain-inspired emotional central mech- anism. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 632–641, 2024

  37. [37]

    Ous: Scene-guided dynamic facial ex- pression recognition

    Xinji Mai, Haoran Wang, Zeng Tao, Junxiong Lin, Shaoqi Yan, Yan Wang, Jing Liu, Jiawen Yu, Xuan Tong, Yating Li, et al. Ous: Scene-guided dynamic facial ex- pression recognition. CoRR, 2024

  38. [38]

    Facial landmark-based emotion recognition via directed graph neural network

    Quang Tran Ngoc, Seunghyun Lee, and Byung Cheol Song. Facial landmark-based emotion recognition via directed graph neural network. Electronics, 9(5):764, 2020

  39. [39]

    Attentive statistics pooling for deep speaker embedding

    Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda. Attentive statistics pooling for deep speaker embedding. arXiv preprint arXiv:1803.10963, 2018

  40. [40]

    Real-time 2d multi-person pose estimation on cpu: Lightweight openpose

    D Osokin. Real-time 2d multi-person pose estimation on cpu: Lightweight openpose. In ICPRAM 2019- Proceedings of the 8th International Conference on Pat- tern Recognition Applications and Methods, pages 744– 748, 2019

  41. [41]

    Privacy-preserving video classification with convolu- tional neural networks

    Sikha Pentyala, Rafael Dowsley, and Martine De Cock. Privacy-preserving video classification with convolu- tional neural networks. In International conference on machine learning, pages 8487–8499. PMLR, 2021

  42. [42]

    Multitask, multilabel, and multidomain learning with convolutional networks for emotion recognition

    Gerard Pons and David Masip. Multitask, multilabel, and multidomain learning with convolutional networks for emotion recognition. IEEE Transactions on Cyber- netics, 52(6):4764–4771, 2020

  43. [43]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Inter- national conference on machine learning, pages 28492– 28518. PMLR, 2023

  44. [44]

    Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition

    Rajeev Ranjan, Vishal M Patel, and Rama Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE transactions on pattern analysis and machine intelligence, 41(1):121–135, 2017

  45. [45]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medi- cal image computing and computer-assisted interven- tion, pages 234–241. Springer, 2015

  46. [46]

    Neural network model for video-based analysis of student’s emotions in e-learning

    Andrey V Savchenko and IA Makarov. Neural network model for video-based analysis of student’s emotions in e-learning. Optical Memory and Neural Networks, 31(3): 237–244, 2022

  47. [47]

    Audio- visual automatic group affect analysis

    Garima Sharma, Abhinav Dhall, and Jianfei Cai. Audio- visual automatic group affect analysis. IEEE Transac- tions on Affective Computing, 2021. 16 AUTHOR VERSION

  48. [48]

    End-to-end multi-person pose estimation with trans- formers

    Dahu Shi, Xing Wei, Liangqi Li, Ye Ren, and Wenming Tan. End-to-end multi-person pose estimation with trans- formers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11069– 11078, 2022

  49. [49]

    Do i have your attention: A large scale engagement prediction dataset and baselines

    Monisha Singh, Ximi Hoque, Donghuo Zeng, Yanan Wang, Kazushi Ikeda, and Abhinav Dhall. Do i have your attention: A large scale engagement prediction dataset and baselines. In Proceedings of the 25th International Conference on Multimodal Interaction , pages 174–182, 2023

  50. [50]

    Multi-modal fusion us- ing spatio-temporal and static features for group emotion recognition

    Mo Sun, Jian Li, Hui Feng, Wei Gou, Haifeng Shen, Jian Tang, Yi Yang, and Jieping Ye. Multi-modal fusion us- ing spatio-temporal and static features for group emotion recognition. Proceedings of the 2020 International Con- ference on Multimodal Interaction (ICMI 2020) , pages 835–840, 10 2020. doi:10.1145/3382507.3417971

  51. [51]

    Group emotion recogni- tion with individual facial emotion cnns and global image based cnns

    Lianzhi Tan, Kaipeng Zhang, Kai Wang, Xiaoxing Zeng, Xiaojiang Peng, and Yu Qiao. Group emotion recogni- tion with individual facial emotion cnns and global image based cnns. In Proceedings of the 19th ACM interna- tional conference on multimodal interaction, pages 549– 552, 2017

  52. [52]

    Align-dfer: Pioneering comprehensive dynamic affective alignment for dynamic facial expres- sion recognition with clip

    Zeng Tao, Yan Wang, Junxiong Lin, Haoran Wang, Xinji Mai, Jiawen Yu, Xuan Tong, Ziheng Zhou, Shaoqi Yan, Qing Zhao, et al. Align-dfer: Pioneering comprehensive dynamic affective alignment for dynamic facial expres- sion recognition with clip. CoRR, 2024

  53. [53]

    Tcct-net: Two-stream network architecture for fast and efficient engagement es- timation via behavioral feature signals

    Alexander Vedernikov, Puneet Kumar, Haoyu Chen, Tapio Seppänen, and Xiaobai Li. Tcct-net: Two-stream network architecture for fast and efficient engagement es- timation via behavioral feature signals. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR) Workshops, pages 4723–4732, June 2024

  54. [54]

    Social signal processing: Survey of an emerging domain

    Alessandro Vinciarelli, Maja Pantic, and Hervé Bourlard. Social signal processing: Survey of an emerging domain. Image and vision computing, 27(12):1743–1759, 2009

  55. [55]

    Hierarchical audio-visual information fusion with multi-label joint decoding for mer 2023

    Haotian Wang, Yuxuan Xi, Hang Chen, Jun Du, Yan Song, Qing Wang, Hengshun Zhou, Chenxi Wang, Jiefeng Ma, Pengfei Hu, et al. Hierarchical audio-visual information fusion with multi-label joint decoding for mer 2023. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9531–9535, 2023

  56. [56]

    Improving multi-modal emo- tion recognition using entropy-based fusion and pruning- based network architecture optimization

    Haotian Wang, Jun Du, Yusheng Dai, Chin-Hui Lee, Yuling Ren, and Yu Liu. Improving multi-modal emo- tion recognition using entropy-based fusion and pruning- based network architecture optimization. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 11766– 11770. IEEE, 2024

  57. [57]

    Cas- cade attention networks for group emotion recognition with face, body and image cues

    Kai Wang, Xiaoxing Zeng, Jianfei Yang, Debin Meng, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. Cas- cade attention networks for group emotion recognition with face, body and image cues. In Proceedings of the 20th ACM international conference on multimodal inter- action, pages 640–645, 2018

  58. [58]

    A joint local spatial and global temporal cnn-transformer for dynamic facial expression recogni- tion

    Linhuang Wang, Xin Kang, Fei Ding, Satoshi Nakagawa, and Fuji Ren. A joint local spatial and global temporal cnn-transformer for dynamic facial expression recogni- tion. Applied Soft Computing, 161:111680, 2024

  59. [59]

    Skeleton-based st-gcn for human action recognition with extended skeleton graph and partitioning strategy

    Quanyu Wang, Kaixiang Zhang, and Manjotho Ali As- ghar. Skeleton-based st-gcn for human action recognition with extended skeleton graph and partitioning strategy. IEEE Access, 10:41403–41410, 2022

  60. [60]

    Vitpose: Simple vision transformer baselines for human pose estimation

    Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation. Advances in neural information pro- cessing systems, 35:38571–38584, 2022

  61. [61]

    Multi- clue fusion for emotion recognition in the wild

    Jingwei Yan, Wenming Zheng, Zhen Cui, Chuangao Tang, Tong Zhang, Yuan Zong, and Ning Sun. Multi- clue fusion for emotion recognition in the wild. In Pro- ceedings of the 18th ACM International Conference on Multimodal Interaction, pages 458–463, 2016

  62. [62]

    Npu-ntu system for voice privacy 2024 chal- lenge

    Jixun Yao, Nikita Kuzmin, Qing Wang, Pengcheng Guo, Ziqian Ning, Dake Guo, Kong Aik Lee, Eng-Siong Chng, and Lei Xie. Npu-ntu system for voice privacy 2024 chal- lenge. arXiv preprint arXiv:2409.04173, 2024

  63. [63]

    Multi-task convolu- tional neural network for pose-invariant face recognition

    Xi Yin and Xiaoming Liu. Multi-task convolu- tional neural network for pose-invariant face recognition. IEEE Transactions on Image Processing, 27(2):964–975, 2017

  64. [64]

    LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

    Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: construction of a large-scale im- age dataset using deep learning with humans in the loop. CoRR, abs/1506.03365, 2015. URL http://arxiv. org/abs/1506.03365

  65. [65]

    Two-dimensional human pose estimation with deep learning: A review.Ap- plied Sciences, 15(13):7344, 2025

    Zheyu Zhang and Seong-Yoon Shin. Two-dimensional human pose estimation with deep learning: A review.Ap- plied Sciences, 15(13):7344, 2025

  66. [66]

    Adaptive key role guided hierarchical relation inference for enhanced group-level emotion recognition

    Qing Zhu, Qirong Mao, Wenlong Dong, Xiuyan Shao, Xiaohua Huang, and Wenming Zheng. Adaptive key role guided hierarchical relation inference for enhanced group-level emotion recognition. IEEE Transactions on Affective Computing, 2025

  67. [67]

    Privacy aware affec- tive state recognition from visual data

    M Sami Zitouni, Peter Lee, Uichin Lee, Leontios J Had- jileontiadis, and Ahsan Khandoker. Privacy aware affec- tive state recognition from visual data. IEEE Access, 10: 40620–40628, 2022

  68. [68]

    Daoming Zong, Chaoyue Ding, Baoxiang Li, Dinghao Zhou, Jiakui Li, Ken Zheng, and Qunyan Zhou. Build- ing robust multimodal sentiment recognition via a simple 17 AUTHOR VERSION Ground Truth Skeleton PredictedSkeleton Ground Truth Skeleton PredictedSkeleton Figure 7: Qualitative examples of predicted structural repre- sentations (SR) obtained with VE-MD-Per...