arxiv: 2604.02397 · v1 · submitted 2026-04-02 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Variational Encoder--Multi-Decoder (VE-MD) for Privacy-by-functional-design (Group) Emotion Recognition

Anderson Augusma (UGA, Dominique Vaufreydaz (LIG, F\'ed\'erique Letu\'e (SVH), LIG, M-PSI)

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords group emotion recognitionprivacy by functional designvariational encodermulti-decoderstructural representationscollective affectin-the-wild datasetsmultimodal fusion

0 comments

The pith

VE-MD framework recognizes group emotions at state-of-the-art levels by outputting only aggregate affect from shared structural representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VE-MD, a variational encoder with multiple decoders that learns a shared latent representation jointly optimized for group emotion classification and internal prediction of body and facial structures. This setup avoids any per-person emotion labels or identity outputs, relying on functional design to limit the pipeline to collective affect inference in settings like crowds or classrooms. Structural decoding, via either a transformer PersonQuery or dense heatmap decoder, preserves interaction cues that latent optimization alone tends to attenuate in group tasks. Results show this yields up to 90.06 percent accuracy on GAF-3.0 and 82.25 percent on VGAF with audio, while the same structural outputs serve as a denoising step on individual emotion benchmarks.

Core claim

VE-MD constrains outputs to aggregate group-level affect by training a variational encoder on a latent space that supports decoding of variable-size body and face structures; explicit structural supervision maintains interaction-related information needed for collective inference, whereas latent space optimization by itself attenuates those cues in group emotion recognition but provides denoising benefits in individual emotion recognition.

What carries the argument

Variational Encoder-Multi-Decoder (VE-MD) architecture whose decoders (transformer PersonQuery or dense Heatmap) predict body and facial structural representations to regularize the shared latent space for group affect classification.

If this is right

Structural supervision consistently raises performance on group emotion datasets by retaining interaction cues.
The same structural outputs improve individual emotion recognition by acting as a denoising bottleneck.
Multimodal fusion with audio reaches 82.25 percent on VGAF and outperforms prior work on SamSemo at 77.9 percent.
Group-level tasks require explicit structural preservation, unlike individual tasks where latent optimization suffices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same encoder-decoder split could be applied to other collective behaviors such as group attention or crowd dynamics without adding per-person outputs.
Deployment in regulated environments would still need separate verification that latent features cannot be inverted to recover identities.
Replacing the structural decoders with purely latent constraints would likely reduce accuracy on group datasets while preserving the privacy property.
The observed GER versus IER performance gap suggests that collective affect modeling benefits from mechanisms that explicitly encode pairwise or spatial relations.

Load-bearing premise

Avoiding explicit per-person emotion or identity outputs is enough to deliver meaningful privacy protection when the model is deployed on real group data.

What would settle it

An adversary successfully reconstructing individual identities or per-person emotions from the latent representations or decoder outputs on a held-out test set would falsify the privacy claim.

Figures

Figures reproduced from arXiv: 2604.02397 by Anderson Augusma (UGA, Dominique Vaufreydaz (LIG, F\'ed\'erique Letu\'e (SVH), LIG, M-PSI).

**Figure 1.** Figure 1: Overview of the proposed VE-MD architecture. Left: input data, consisting of video frames or a single image. Middle: the Variational Encoder (VE, green box), which learns two latent spaces: Z1 for emotion recognition and Z2 for joint optimization of emotion recognition and person structural representation. Right: the multi-decoder head, where the Emotion Decoder uses Z1 and Z2, optionally complemented by s… view at source ↗

**Figure 2.** Figure 2: PersonQuery decoder. Left: the latent space is processed by an auxiliary convolutional module that produces three multi-scale feature tensors, F1, F2, and F3, which are flattened and fed to the transformer encoder. Right: the transformer decoder receives target queries together with the encoded features and predicts the structural representation and the adjacency matrix through a multi-layer perceptron h… view at source ↗

**Figure 3.** Figure 3: Heatmap decoder. At the top is the VE latentspace input, followed by a custom UNet-upsample network, and the limbs decoder is applied to predict the limb heatmap. The same process is applied to the structural representation of faces. direct structural representations through limb-connection estimation inspired by OpenPose [11]. In contrast to PersonQuery, the Heatmap decoder naturally adapts to any numb… view at source ↗

**Figure 4.** Figure 4: Overview of datasets used in this study. GAF-3.0 and VGAF at left illustrate Group Emotion Recognition (GER) datasets, followed by examples from IER datasets: DFEW, SAMSEMO, MER-MULTI, and EngageNet [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Example of automatic data annotation for structural representation. The first line corresponds to body annotations using ViTPose with 18 limb connections (COCO-style). The second line corresponds to face annotations using FaceAlignment with 20 custom limb connections. datasets, when the scene is crowded, or when framing, distance, and lighting conditions are not adequate, automatic annotation can generat… view at source ↗

**Figure 6.** Figure 6: Summary of VE-MD versus SOTA across datasets. The first two columns correspond to GER datasets. Asterisks (*) indicates a statistically significant difference versus the SOTA using a two-sided McNemar test (p < 0.05). 9 Conclusion This paper introduced the Variational Encoder–Multi-Decoder (VE-MD) framework for group emotion recognition under a privacy-aware functional design. Rather than relying on iden… view at source ↗

**Figure 7.** Figure 7: Qualitative examples of predicted structural representations (SR) obtained with VE-MD-PersonQuery on MERMULTI using a query value of 6. The model predicts more structures than those aligned with the ground truth, illustrating how noisy or redundant SR predictions can arise. yet effective multimodal transformer. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9596–9600, 2023.… view at source ↗

read the original abstract

Group Emotion Recognition (GER) aims to infer collective affect in social environments such as classrooms, crowds, and public events. Many existing approaches rely on explicit individual-level processing, including cropped faces, person tracking, or per-person feature extraction, which makes the analysis pipeline person-centric and raises privacy concerns in deployment scenarios where only group-level understanding is needed. This research proposes VE-MD, a Variational Encoder-Multi-Decoder framework for group emotion recognition under a privacy-aware functional design. Rather than providing formal anonymization or cryptographic privacy guarantees, VE-MD is designed to avoid explicit individual monitoring by constraining the model to predict only aggregate group-level affect, without identity recognition or per-person emotion outputs. VE-MD learns a shared latent representation jointly optimized for emotion classification and internal prediction of body and facial structural representations. Two structural decoding strategies are investigated: a transformer-based PersonQuery decoder and a dense Heatmap decoder that naturally accommodates variable group sizes. Experiments on six in-the-wild datasets, including two GER and four Individual Emotion Recognition (IER) benchmarks, show that structural supervision consistently improves representation learning. More importantly, the results reveal a clear distinction between GER and IER: optimizing the latent space alone is often insufficient for GER because it tends to attenuate interaction-related cues, whereas preserving explicit structural outputs improves collective affect inference. In contrast, projected structural representations seem to act as an effective denoising bottleneck for IER. VE-MD achieves state-of-the-art performance on GAF-3.0 (up to 90.06%) and VGAF (82.25% with multimodal fusion with audio). These results show that preserving interaction-related structural information is particularly beneficial for group-level affect modeling without relying on prior individual feature extraction. On IER datasets using multimodal fusion with audio modality, VE-MD outperforms SOTA on SamSemo (77.9%, adding text modality) while achieving competitive performances on MER-MULTI (63.8%), DFEW (70.7%) and EngageNet (69.0).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VE-MD gives a workable functional privacy pattern for group emotion recognition with clear empirical gains on group datasets, but the privacy protection stays untested against leakage.

read the letter

The main thing here is a variational encoder paired with two structural decoders that produces only group-level affect predictions while using body and face geometry as auxiliary tasks. This setup reaches 90.06% on GAF-3.0 and 82.25% on VGAF with audio, and it shows a useful split in behavior: structural outputs help group tasks by keeping interaction signals, whereas they mostly denoise for individual emotion recognition. The PersonQuery transformer and heatmap decoders both handle variable group sizes without fixed person counts, which removes the usual upfront detection and tracking steps. Experiments run across six datasets, including multimodal cases, and the results look competitive with prior work. The architecture is straightforward to implement and the distinction between GER and IER under structural supervision is a fresh observation worth noting. The soft spot is the privacy side. The model never emits per-person labels by design, yet the structural decoders still reconstruct person-level geometry from the shared latent code. Without any inversion tests, membership inference checks, or reconstruction metrics, it is not clear how much individual information actually leaks through the latent space or decoder outputs. The paper treats this as sufficient protection, but that remains an assumption rather than a measured result. This work is aimed at affective computing researchers who need group-level analysis in settings where individual tracking is undesirable. Readers focused on privacy-preserving vision will find the concrete architecture and the GER-specific insight useful, even if they want additional leakage experiments. I would send it to peer review. The empirical results and architectural pattern are grounded enough for referees to evaluate and request the missing validation steps.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes VE-MD, a Variational Encoder-Multi-Decoder framework for group emotion recognition (GER) under a privacy-by-functional-design approach. It learns a shared latent representation jointly optimized for group-level affect classification and structural decoding of faces/bodies (via PersonQuery transformer or Heatmap decoder), explicitly avoiding per-person emotion or identity outputs. Experiments across six in-the-wild datasets report SOTA results on GER benchmarks (90.06% on GAF-3.0; 82.25% on VGAF with audio fusion) and competitive performance on individual emotion recognition (IER) tasks, arguing that explicit structural outputs preserve interaction cues beneficial for GER but act as a denoising bottleneck for IER.

Significance. If the empirical results and the functional privacy premise hold under scrutiny, the work would provide a practical route to privacy-aware GER in sensitive settings (classrooms, crowds) without relying on explicit individual processing or cryptographic primitives. The reported distinction between GER and IER regarding the utility of structural supervision is a potentially useful insight for representation learning in affective computing.

major comments (2)

[Abstract] Abstract: The privacy-by-functional-design claim is load-bearing for the paper's contribution yet rests on the unverified premise that group-level outputs plus structural decoders (PersonQuery/Heatmap) cannot leak individual identity or per-person affect; no reconstruction metrics, membership-inference results, or inversion attacks on the latent space are reported despite the explicit disclaimer of formal guarantees.
[Abstract] Abstract / Experimental section: SOTA numbers (90.06% on GAF-3.0, 82.25% on VGAF) are stated without reference to the experimental protocol, baseline implementations, ablation controls, or statistical significance tests, preventing verification that gains are attributable to the claimed VE-MD mechanisms rather than dataset-specific factors or multimodal fusion.

minor comments (1)

The motivation for why structural supervision attenuates interaction cues in GER but denoises IER could be strengthened with a brief comparison to prior interaction-modeling literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below with clarifications and proposed revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The privacy-by-functional-design claim is load-bearing for the paper's contribution yet rests on the unverified premise that group-level outputs plus structural decoders (PersonQuery/Heatmap) cannot leak individual identity or per-person affect; no reconstruction metrics, membership-inference results, or inversion attacks on the latent space are reported despite the explicit disclaimer of formal guarantees.

Authors: We agree that the functional privacy claim requires careful framing. The manuscript already states that VE-MD provides no formal anonymization or cryptographic guarantees and is instead designed to avoid explicit individual monitoring by producing only group-level outputs. We will revise the abstract and add a dedicated paragraph in the introduction to explicitly distinguish functional design from formal privacy methods. We will also expand the limitations section to acknowledge the absence of reconstruction, membership-inference, or inversion attack evaluations. This constitutes a partial revision focused on clarification rather than new empirical privacy experiments. revision: partial
Referee: [Abstract] Abstract / Experimental section: SOTA numbers (90.06% on GAF-3.0, 82.25% on VGAF) are stated without reference to the experimental protocol, baseline implementations, ablation controls, or statistical significance tests, preventing verification that gains are attributable to the claimed VE-MD mechanisms rather than dataset-specific factors or multimodal fusion.

Authors: The experimental section (Section 4) already details the evaluation protocols, baseline re-implementations, ablation studies on the PersonQuery and Heatmap decoders, and reports results with standard deviations across multiple runs. To address the concern, we will revise the abstract to include concise references to the key experimental settings and cross-reference the relevant subsections. We will also ensure all performance claims are accompanied by explicit baseline comparisons in the main text. This is a straightforward revision to improve readability and verifiability. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical results on external benchmarks with no derivations or self-referential fitting

full rationale

The manuscript contains no equations, derivations, or parameter-fitting steps that could reduce to self-definition. Performance claims (e.g., 90.06% on GAF-3.0) rest on direct evaluation against independent public datasets. The privacy-by-functional-design argument is an architectural choice (group-level outputs only, no per-person labels) rather than a derived result; it is not obtained by fitting or by self-citation chains. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided text. This is the normal case of a purely empirical architecture paper whose central claims are externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond standard variational auto-encoder assumptions; the central claim rests on empirical performance rather than new theoretical constructs.

axioms (1)

standard math Standard variational inference and reconstruction objectives for encoder-decoder networks
Implicit reliance on VAE training assumptions common in the literature.

pith-pipeline@v0.9.0 · 5712 in / 1182 out tokens · 37871 ms · 2026-05-13T21:49:43.971158+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 1 internal anchor

[1]

Engagement measure- ment based on facial landmarks and spatial-temporal graph convolutional networks

Ali Abedi and Shehroz S Khan. Engagement measure- ment based on facial landmarks and spatial-temporal graph convolutional networks. arXiv e-prints , pages arXiv–2403, 2024

work page 2024
[2]

How the gdpr will change the world

Jan Philipp Albrecht. How the gdpr will change the world. Eur. Data Prot. L. Rev., 2:287, 2016

work page 2016
[3]

Exceda: Unlocking attention paradigms in extended duration e-classrooms by leveraging attention- mechanism models

Avinash Anand, Avni Mittal, Laavanaya Dhawan, Juhi Krishnamurthy, Mahisha Ramesh, Naman Lal, Astha Verma, Pijush Bhuyan, Raijv Ratn Shah, Roger Zimmer- mann, et al. Exceda: Unlocking attention paradigms in extended duration e-classrooms by leveraging attention- mechanism models. In2024 IEEE 7th International Con- ference on Multimedia Information Processi...

work page 2024
[4]

Multimodal Group Emotion Recog- nition In-the-wild Using Privacy-Compliant Features

Augusma, Anderson and Vaufreydaz, Dominique and Letué, Frédérique. Multimodal Group Emotion Recog- nition In-the-wild Using Privacy-Compliant Features. In Proceedings of the 25th International Conference on Multimodal Interaction, pages 750–754, 2023

work page 2023
[5]

wav2vec 2.0: A framework for self- 14 AUTHOR VERSION supervised learning of speech representations

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self- 14 AUTHOR VERSION supervised learning of speech representations. Advances in neural information processing systems , 33:12449– 12460, 2020

work page 2020
[6]

Samsemo: New dataset for multilingual and multimodal emotion recognition

Paweł Bujnowski, Bartłomiej Ku´ zma, Bartłomiej Paziewski, Jacek Rutkowski, Joanna Marhula, Zuzanna Bordzicka, and Piotr Andruszkiewicz. Samsemo: New dataset for multilingual and multimodal emotion recognition. In Interspeech, 2024

work page 2024
[7]

How far are we from solving the 2d & 3d face alignment prob- lem?(and a dataset of 230,000 3d facial landmarks)

Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment prob- lem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE international conference on computer vision, pages 1021–1030, 2017

work page 2017
[8]

Human observers and au- tomated assessment of dynamic emotional facial expres- sions: Kdef-dyn database validation

Manuel G Calvo, Andrés Fernández-Martín, Guillermo Recio, and Daniel Lundqvist. Human observers and au- tomated assessment of dynamic emotional facial expres- sions: Kdef-dyn database validation. Frontiers in psy- chology, 9:2052, 2018

work page 2052
[9]

The eu’s ai act: A framework for collaborative governance

Celso Cancela-Outeda. The eu’s ai act: A framework for collaborative governance. Internet of Things, 27:101291, 2024

work page 2024
[10]

Realtime multi-person 2d pose estimation using part affinity fields

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291– 7299, 2017

work page 2017
[11]

Openpose: Realtime multi-person 2d pose estimation using part affinity fields

Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE trans- actions on pattern analysis and machine intelligence, 43 (1):172–186, 2019

work page 2019
[12]

End-to-end object detection with transform- ers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transform- ers. In European conference on computer vision , pages 213–229. Springer, 2020

work page 2020
[13]

Règlement sur l’intelligence artificielle- version enrichie, 2024

Bertrand Cassar. Règlement sur l’intelligence artificielle- version enrichie, 2024

work page 2024
[14]

Finecliper: Multi-modal fine- grained clip for dynamic facial expression recognition with adapters

Haodong Chen, Haojian Huang, Junhao Dong, Mingzhe Zheng, and Dian Shao. Finecliper: Multi-modal fine- grained clip for dynamic facial expression recognition with adapters. In Proceedings of the 32nd ACM Inter- national Conference on Multimedia , pages 2301–2310, 2024

work page 2024
[15]

System de- scription for voice privacy challenge 2022

Xiaojiao Chen, Guangxing Li, Hao Huang, Wangjin Zhou, Sheng Li, Yang Cao, and Yi Zhao. System de- scription for voice privacy challenge 2022. In Proc. 2nd Symposium on Security and Privacy in Speech Commu- nication, 2022

work page 2022
[16]

Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based chal- lenges

Abhinav Dhall, Garima Sharma, Roland Goecke, and Tom Gedeon. Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based chal- lenges. In Proceedings of the 2020 International Confer- ence on Multimodal Interaction, pages 784–789, 2020

work page 2020
[17]

Emotiw 2023: Emotion recognition in the wild challenge

Abhinav Dhall, Monisha Singh, Roland Goecke, Tom Gedeon, Donghuo Zeng, Yanan Wang, and Kazushi Ikeda. Emotiw 2023: Emotion recognition in the wild challenge. In Proceedings of the 25th International Con- ference on Multimodal Interaction (ICMI 2023), 2023

work page 2023
[18]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021

work page 2021
[19]

Training generative neural networks via maximum mean discrepancy optimization

Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In Proceed- ings of the Thirty-First Conference on Uncertainty in Ar- tificial Intelligence, pages 258–267, 2015

work page 2015
[20]

Ebner, Michaela Riediger, and Ulman Lindenberger

Natalie C. Ebner, Michaela Riediger, and Ulman Lindenberger. Faces-a database of facial expres- sions in young, middle-aged, and older women and men: Development and validation. Behavior Re- search Methods, 42:351–362, 2 2010. ISSN 1554351X. doi:10.3758/BRM.42.1.351

work page doi:10.3758/brm.42.1.351 2010
[21]

Multi-task learning on the edge for ef- fective gender, age, ethnicity and emotion recognition

Pasquale Foggia, Antonio Greco, Alessia Saggese, and Mario Vento. Multi-task learning on the edge for ef- fective gender, age, ethnicity and emotion recognition. Engineering Applications of Artificial Intelligence , 118: 105651, 2023

work page 2023
[22]

Emoclip: A vision-language method for zero-shot video facial ex- pression recognition

Niki Maria Foteinopoulou and Ioannis Patras. Emoclip: A vision-language method for zero-shot video facial ex- pression recognition. In 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), pages 1–10. IEEE, 2024

work page 2024
[23]

Graph neural networks for image under- standing based on multiple cues: Group emotion recog- nition and event recognition as use cases

Xin Guo, Luisa Polania, Bin Zhu, Charles Boncelet, and Kenneth Barner. Graph neural networks for image under- standing based on multiple cues: Group emotion recog- nition and event recognition as use cases. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2921–2930, 2020

work page 2020
[24]

An attention model for group-level emotion recognition

Aarush Gupta, Dakshit Agrawal, Hardik Chauhan, Jose Dolz, and Marco Pedersoli. An attention model for group-level emotion recognition. In Proceedings of the 20th ACM International Conference on Multimodal In- teraction, pages 611–615, 2018

work page 2018
[25]

Multimodal face-pose estimation with multitask manifold deep learning

Chaoqun Hong, Jun Yu, Jian Zhang, Xiongnan Jin, and Kyong-Ho Lee. Multimodal face-pose estimation with multitask manifold deep learning. IEEE transactions on industrial informatics, 15(7):3952–3961, 2018

work page 2018
[26]

Deep multi-task 15 AUTHOR VERSION learning to recognise subtle facial expressions of men- tal states

Guosheng Hu, Li Liu, Yang Yuan, Zehao Yu, Yang Hua, Zhihong Zhang, Fumin Shen, Ling Shao, Timo- thy Hospedales, Neil Robertson, et al. Deep multi-task 15 AUTHOR VERSION learning to recognise subtle facial expressions of men- tal states. In Proceedings of the European conference on computer vision (ECCV), pages 103–119, 2018

work page 2018
[27]

Psmf: Prototype network subgraph with multi-head attention framework for group emotion recognition

Wenti Huang, Jun Long, et al. Psmf: Prototype network subgraph with multi-head attention framework for group emotion recognition. Not published yet (Rewiew) , page 121969, 2025

work page 2025
[28]

Dfew: A large-scale database for recognizing dynamic facial ex- pressions in the wild

Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, and Jiateng Liu. Dfew: A large-scale database for recognizing dynamic facial ex- pressions in the wild. In Proceedings of the 28th ACM International Conference on Multimedia , pages 2881– 2889, 2020

work page 2020
[29]

Comparative analysis of openpose, posenet, and movenet models for pose estima- tion in mobile devices

BeomJun Jo and SeongKi Kim. Comparative analysis of openpose, posenet, and movenet models for pose estima- tion in mobile devices. Traitement du Signal, 39(1):119, 2022

work page 2022
[30]

Joint fine-tuning in deep neural net- works for facial expression recognition

Heechul Jung, Sihaeng Lee, Junho Yim, Sunjeong Park, and Junmo Kim. Joint fine-tuning in deep neural net- works for facial expression recognition. In Proceedings of the IEEE international conference on computer vision, pages 2983–2991, 2015

work page 2015
[31]

Object detection in real time based on improved sin- gle shot multi-box detector algorithm

Ashwani Kumar, Zuopeng Justin Zhang, and Hongbo Lyu. Object detection in real time based on improved sin- gle shot multi-box detector algorithm. EURASIP Journal on Wireless Communications and Networking , 2020(1): 204, 2020

work page 2020
[32]

Fusing multimodal streams for improved group emotion recognition in videos

Deepak Kumar, Piyush Dhamdhere, and Balasubrama- nian Raman. Fusing multimodal streams for improved group emotion recognition in videos. In International Conference on Pattern Recognition , pages 403–418. Springer, 2025

work page 2025
[33]

Exploring vq-vae with prosody pa- rameters for speaker anonymization

Sotheara Leang, Anderson Augusma, Eric Castelli, Frédérique Letué, Sethserey Sam, and Dominique Vaufreydaz. Exploring vq-vae with prosody pa- rameters for speaker anonymization. arXiv preprint arXiv:2409.15882, 2024

work page arXiv 2024
[34]

Mer 2023: Multi-label learning, modal- ity robustness, and semi-supervised learning

Zheng Lian, Haiyang Sun, Licai Sun, Kang Chen, Mngyu Xu, Kexin Wang, Ke Xu, Yu He, Ying Li, Jin- ming Zhao, et al. Mer 2023: Multi-label learning, modal- ity robustness, and semi-supervised learning. InProceed- ings of the 31st ACM International Conference on Multi- media, pages 9610–9614, 2023

work page 2023
[35]

Group level audio-video emotion recog- nition using hybrid networks

Chuanhe Liu, Wenqiang Jiang, Minghao Wang, and Tianhao Tang. Group level audio-video emotion recog- nition using hybrid networks. In Proceedings of the 2020 International Conference on Multimodal Interac- tion (ICMI 2020), pages 807–812. Association for Com- puting Machinery, Inc, 10 2020. ISBN 9781450375818. doi:10.1145/3382507.3417968

work page doi:10.1145/3382507.3417968 2020
[36]

All rivers run into the sea: Unified modality brain-inspired emotional central mech- anism

Xinji Mai, Junxiong Lin, Haoran Wang, Zeng Tao, Yan Wang, Shaoqi Yan, Xuan Tong, Jiawen Yu, Boyang Wang, Ziheng Zhou, et al. All rivers run into the sea: Unified modality brain-inspired emotional central mech- anism. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 632–641, 2024

work page 2024
[37]

Ous: Scene-guided dynamic facial ex- pression recognition

Xinji Mai, Haoran Wang, Zeng Tao, Junxiong Lin, Shaoqi Yan, Yan Wang, Jing Liu, Jiawen Yu, Xuan Tong, Yating Li, et al. Ous: Scene-guided dynamic facial ex- pression recognition. CoRR, 2024

work page 2024
[38]

Facial landmark-based emotion recognition via directed graph neural network

Quang Tran Ngoc, Seunghyun Lee, and Byung Cheol Song. Facial landmark-based emotion recognition via directed graph neural network. Electronics, 9(5):764, 2020

work page 2020
[39]

Attentive statistics pooling for deep speaker embedding

Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda. Attentive statistics pooling for deep speaker embedding. arXiv preprint arXiv:1803.10963, 2018

work page arXiv 2018
[40]

Real-time 2d multi-person pose estimation on cpu: Lightweight openpose

D Osokin. Real-time 2d multi-person pose estimation on cpu: Lightweight openpose. In ICPRAM 2019- Proceedings of the 8th International Conference on Pat- tern Recognition Applications and Methods, pages 744– 748, 2019

work page 2019
[41]

Privacy-preserving video classification with convolu- tional neural networks

Sikha Pentyala, Rafael Dowsley, and Martine De Cock. Privacy-preserving video classification with convolu- tional neural networks. In International conference on machine learning, pages 8487–8499. PMLR, 2021

work page 2021
[42]

Multitask, multilabel, and multidomain learning with convolutional networks for emotion recognition

Gerard Pons and David Masip. Multitask, multilabel, and multidomain learning with convolutional networks for emotion recognition. IEEE Transactions on Cyber- netics, 52(6):4764–4771, 2020

work page 2020
[43]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Inter- national conference on machine learning, pages 28492– 28518. PMLR, 2023

work page 2023
[44]

Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition

Rajeev Ranjan, Vishal M Patel, and Rama Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE transactions on pattern analysis and machine intelligence, 41(1):121–135, 2017

work page 2017
[45]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medi- cal image computing and computer-assisted interven- tion, pages 234–241. Springer, 2015

work page 2015
[46]

Neural network model for video-based analysis of student’s emotions in e-learning

Andrey V Savchenko and IA Makarov. Neural network model for video-based analysis of student’s emotions in e-learning. Optical Memory and Neural Networks, 31(3): 237–244, 2022

work page 2022
[47]

Audio- visual automatic group affect analysis

Garima Sharma, Abhinav Dhall, and Jianfei Cai. Audio- visual automatic group affect analysis. IEEE Transac- tions on Affective Computing, 2021. 16 AUTHOR VERSION

work page 2021
[48]

End-to-end multi-person pose estimation with trans- formers

Dahu Shi, Xing Wei, Liangqi Li, Ye Ren, and Wenming Tan. End-to-end multi-person pose estimation with trans- formers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11069– 11078, 2022

work page 2022
[49]

Do i have your attention: A large scale engagement prediction dataset and baselines

Monisha Singh, Ximi Hoque, Donghuo Zeng, Yanan Wang, Kazushi Ikeda, and Abhinav Dhall. Do i have your attention: A large scale engagement prediction dataset and baselines. In Proceedings of the 25th International Conference on Multimodal Interaction , pages 174–182, 2023

work page 2023
[50]

Multi-modal fusion us- ing spatio-temporal and static features for group emotion recognition

Mo Sun, Jian Li, Hui Feng, Wei Gou, Haifeng Shen, Jian Tang, Yi Yang, and Jieping Ye. Multi-modal fusion us- ing spatio-temporal and static features for group emotion recognition. Proceedings of the 2020 International Con- ference on Multimodal Interaction (ICMI 2020) , pages 835–840, 10 2020. doi:10.1145/3382507.3417971

work page doi:10.1145/3382507.3417971 2020
[51]

Group emotion recogni- tion with individual facial emotion cnns and global image based cnns

Lianzhi Tan, Kaipeng Zhang, Kai Wang, Xiaoxing Zeng, Xiaojiang Peng, and Yu Qiao. Group emotion recogni- tion with individual facial emotion cnns and global image based cnns. In Proceedings of the 19th ACM interna- tional conference on multimodal interaction, pages 549– 552, 2017

work page 2017
[52]

Align-dfer: Pioneering comprehensive dynamic affective alignment for dynamic facial expres- sion recognition with clip

Zeng Tao, Yan Wang, Junxiong Lin, Haoran Wang, Xinji Mai, Jiawen Yu, Xuan Tong, Ziheng Zhou, Shaoqi Yan, Qing Zhao, et al. Align-dfer: Pioneering comprehensive dynamic affective alignment for dynamic facial expres- sion recognition with clip. CoRR, 2024

work page 2024
[53]

Tcct-net: Two-stream network architecture for fast and efficient engagement es- timation via behavioral feature signals

Alexander Vedernikov, Puneet Kumar, Haoyu Chen, Tapio Seppänen, and Xiaobai Li. Tcct-net: Two-stream network architecture for fast and efficient engagement es- timation via behavioral feature signals. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR) Workshops, pages 4723–4732, June 2024

work page 2024
[54]

Social signal processing: Survey of an emerging domain

Alessandro Vinciarelli, Maja Pantic, and Hervé Bourlard. Social signal processing: Survey of an emerging domain. Image and vision computing, 27(12):1743–1759, 2009

work page 2009
[55]

Hierarchical audio-visual information fusion with multi-label joint decoding for mer 2023

Haotian Wang, Yuxuan Xi, Hang Chen, Jun Du, Yan Song, Qing Wang, Hengshun Zhou, Chenxi Wang, Jiefeng Ma, Pengfei Hu, et al. Hierarchical audio-visual information fusion with multi-label joint decoding for mer 2023. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9531–9535, 2023

work page 2023
[56]

Improving multi-modal emo- tion recognition using entropy-based fusion and pruning- based network architecture optimization

Haotian Wang, Jun Du, Yusheng Dai, Chin-Hui Lee, Yuling Ren, and Yu Liu. Improving multi-modal emo- tion recognition using entropy-based fusion and pruning- based network architecture optimization. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 11766– 11770. IEEE, 2024

work page 2024
[57]

Cas- cade attention networks for group emotion recognition with face, body and image cues

Kai Wang, Xiaoxing Zeng, Jianfei Yang, Debin Meng, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. Cas- cade attention networks for group emotion recognition with face, body and image cues. In Proceedings of the 20th ACM international conference on multimodal inter- action, pages 640–645, 2018

work page 2018
[58]

A joint local spatial and global temporal cnn-transformer for dynamic facial expression recogni- tion

Linhuang Wang, Xin Kang, Fei Ding, Satoshi Nakagawa, and Fuji Ren. A joint local spatial and global temporal cnn-transformer for dynamic facial expression recogni- tion. Applied Soft Computing, 161:111680, 2024

work page 2024
[59]

Skeleton-based st-gcn for human action recognition with extended skeleton graph and partitioning strategy

Quanyu Wang, Kaixiang Zhang, and Manjotho Ali As- ghar. Skeleton-based st-gcn for human action recognition with extended skeleton graph and partitioning strategy. IEEE Access, 10:41403–41410, 2022

work page 2022
[60]

Vitpose: Simple vision transformer baselines for human pose estimation

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation. Advances in neural information pro- cessing systems, 35:38571–38584, 2022

work page 2022
[61]

Multi- clue fusion for emotion recognition in the wild

Jingwei Yan, Wenming Zheng, Zhen Cui, Chuangao Tang, Tong Zhang, Yuan Zong, and Ning Sun. Multi- clue fusion for emotion recognition in the wild. In Pro- ceedings of the 18th ACM International Conference on Multimodal Interaction, pages 458–463, 2016

work page 2016
[62]

Npu-ntu system for voice privacy 2024 chal- lenge

Jixun Yao, Nikita Kuzmin, Qing Wang, Pengcheng Guo, Ziqian Ning, Dake Guo, Kong Aik Lee, Eng-Siong Chng, and Lei Xie. Npu-ntu system for voice privacy 2024 chal- lenge. arXiv preprint arXiv:2409.04173, 2024

work page arXiv 2024
[63]

Multi-task convolu- tional neural network for pose-invariant face recognition

Xi Yin and Xiaoming Liu. Multi-task convolu- tional neural network for pose-invariant face recognition. IEEE Transactions on Image Processing, 27(2):964–975, 2017

work page 2017
[64]

LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: construction of a large-scale im- age dataset using deep learning with humans in the loop. CoRR, abs/1506.03365, 2015. URL http://arxiv. org/abs/1506.03365

work page internal anchor Pith review Pith/arXiv arXiv 2015
[65]

Two-dimensional human pose estimation with deep learning: A review.Ap- plied Sciences, 15(13):7344, 2025

Zheyu Zhang and Seong-Yoon Shin. Two-dimensional human pose estimation with deep learning: A review.Ap- plied Sciences, 15(13):7344, 2025

work page 2025
[66]

Adaptive key role guided hierarchical relation inference for enhanced group-level emotion recognition

Qing Zhu, Qirong Mao, Wenlong Dong, Xiuyan Shao, Xiaohua Huang, and Wenming Zheng. Adaptive key role guided hierarchical relation inference for enhanced group-level emotion recognition. IEEE Transactions on Affective Computing, 2025

work page 2025
[67]

Privacy aware affec- tive state recognition from visual data

M Sami Zitouni, Peter Lee, Uichin Lee, Leontios J Had- jileontiadis, and Ahsan Khandoker. Privacy aware affec- tive state recognition from visual data. IEEE Access, 10: 40620–40628, 2022

work page 2022
[68]

Daoming Zong, Chaoyue Ding, Baoxiang Li, Dinghao Zhou, Jiakui Li, Ken Zheng, and Qunyan Zhou. Build- ing robust multimodal sentiment recognition via a simple 17 AUTHOR VERSION Ground Truth Skeleton PredictedSkeleton Ground Truth Skeleton PredictedSkeleton Figure 7: Qualitative examples of predicted structural repre- sentations (SR) obtained with VE-MD-Per...

work page 2023