Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition

Hanna Jang; Hyunseo Kim; Junghyun Lee; Junhyug Noh

arxiv: 2605.21417 · v2 · pith:UT3ABL43new · submitted 2026-05-20 · 💻 cs.CV · cs.AI

Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition

Junghyun Lee , Hyunseo Kim , Hanna Jang , Junhyug Noh This is my paper

Pith reviewed 2026-06-30 17:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords blended emotion recognitionselective fusionmulti-encoderattention gatingmultimodaldomain adaptationemotion detection

0 comments

The pith

A rank-aware method that fuses only the top-n most useful encoders per sample improves detection of blended emotions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that blended emotions, which mix several feelings at once, can be recognized more accurately by drawing on many pre-trained video and audio encoders but using only the most relevant ones for each input. It projects the different encoder outputs into a common space, applies attention to rank their importance sample by sample, and combines just the top few before making separate predictions for emotion presence and strength. This selective step, plus a simple domain adaptation trick, is meant to reduce noise from unhelpful encoders while preserving the overlapping cues that define blended states. If the approach holds, systems for real-world emotion understanding become less dependent on any single encoder and more robust when data distributions shift.

Core claim

The authors establish that projecting heterogeneous encoder features into a shared latent space, estimating sample-wise importance via an attention-based gating module, and fusing only the top-n encoders, while decoupling predictions into presence and salience heads and applying feature-level unsupervised domain adaptation, yields higher accuracy than either single encoders or full naive fusion on the BlEmoRE challenge and secures second place in the competition.

What carries the argument

The attention-based gating module that ranks encoders by estimated importance and performs selective top-n fusion.

If this is right

Selective fusion outperforms both strong single encoders and naive multi-encoder baselines.
Decoupling the task into separate presence and salience heads supports finer modeling of blended states.
Unsupervised domain adaptation at the feature level increases robustness to distribution shifts.
The complete system achieved second place in the BlEmoRE competition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-sample ranking idea could be tested on other multimodal tasks where some encoders are occasionally uninformative.
Full fusion of all encoders may add noise rather than signal when emotions are expressed as mixtures.
Varying the number n across datasets would show whether an optimal selection size exists for different emotion granularities.

Load-bearing premise

The attention gating module can correctly identify which encoders carry the information needed for each sample without discarding cues required to detect overlapping emotions.

What would settle it

Replace the top-n selection with fusion of every available encoder on the BlEmoRE test set and measure whether accuracy falls below the reported selective-fusion result.

Figures

Figures reproduced from arXiv: 2605.21417 by Hanna Jang, Hyunseo Kim, Junghyun Lee, Junhyug Noh.

**Figure 1.** Figure 1: Overview of the proposed framework. Heterogeneous encoder features are first projected into a shared 256-d embedding space. An attentionbased gating module estimates sample-wise encoder importance, after which only the top-n encoders are retained for weighted fusion into a 512-d shared representation. Two prediction heads model emotion presence and salience, and their outputs are aligned through probabili… view at source ↗

**Figure 2.** Figure 2: Effect of the number of selected encoders [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of modality-group importance scores across samples. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Top-n selection frequency for each encoder. A small subset of encoders is selected in most samples, while many others are used much less frequently. The gradually decaying distribution indicates that encoder usefulness is highly uneven, supporting the need for ranking-based selective fusion [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Mean encoder importance across folds. High-importance encoders remain consistently dominant across different folds, while low-importance [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of importance weights for representative encoders. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Pairwise Linear CKA similarity between projected encoder [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Blended emotion recognition is challenging because emotions are often expressed as mixtures of subtle and overlapping multimodal cues rather than a single dominant signal. We propose a rank-aware multi-encoder framework that selectively combines complementary representations from diverse pre-extracted video and audio encoders. Our method projects heterogeneous encoder features into a shared latent space, estimates sample-wise encoder importance through an attention-based gating module, and fuses only the top-n most informative encoders. To better model blended emotions, we decouple prediction into presence and salience heads and align them through probability-level fusion. We further incorporate feature-level unsupervised domain adaptation without pseudo-labeling to improve robustness under distribution shift. Experiments on the BlEmoRE challenge show that the proposed framework outperforms strong individual encoders and na\"ive multi-encoder fusion baselines. Our final system ranked 2nd in the competition, supporting the effectiveness of rank-aware selective fusion for fine-grained blended emotion recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rank-aware selective fusion lands 2nd in the BlEmoRE challenge through per-sample encoder picking and decoupled heads, but the gating step lacks evidence it keeps the low-salience cues that define blends.

read the letter

The paper's main contribution is a practical pipeline that projects features from multiple video and audio encoders into one space, uses attention to rank and keep only the top-n per sample, then fuses at probability level after splitting into presence and salience heads plus some unsupervised adaptation. This setup beat single-encoder baselines and naive fusion on the challenge data and placed second overall.

The selective ranking and the presence/salience split are the clearest engineering moves. Blended emotions are mixtures, so forcing the model to decide both whether a cue is present and how strong it is avoids treating everything as a single dominant label. The competition result gives a concrete external check that the whole thing works better than the obvious alternatives.

The soft spot is the gating itself. The claim that attention reliably surfaces the right subset rests on the assumption that lower-ranked encoders can be dropped without losing necessary information for subtle overlaps. No ablation tests whether forcing in the nth+1 encoder changes the output on blended cases, and there is no direct measure of information loss from the selection step. The top-n threshold is also a free parameter whose sensitivity is not explored beyond the final score. These gaps make it hard to attribute the ranking gain specifically to the rank-aware part rather than better overall alignment or the adaptation trick.

The work is aimed at teams building multimodal systems for affective computing or running similar challenges. It has a real result and a coherent method, so it deserves peer review; a referee can ask for the missing gating ablations and parameter checks without needing to rewrite the core idea.

Referee Report

2 major / 1 minor

Summary. The paper proposes a rank-aware multi-encoder framework for blended emotion recognition. Heterogeneous video and audio encoder features are projected into a shared latent space; an attention-based gating module estimates per-sample encoder importance and fuses only the top-n encoders. Prediction is decoupled into presence and salience heads whose outputs are aligned via probability-level fusion; unsupervised domain adaptation without pseudo-labeling is added for robustness. Experiments on the BlEmoRE challenge are reported to show outperformance over individual encoders and naïve multi-encoder baselines, with the system placing 2nd in the competition.

Significance. If the empirical claims are substantiated with quantitative results, the work would demonstrate that per-sample rank-aware selection can improve fusion for subtle, overlapping multimodal emotion cues relative to full or naïve fusion, with potential value for fine-grained affective computing tasks.

major comments (2)

Abstract: the central claim that the framework 'outperforms strong individual encoders and naïve multi-encoder fusion baselines' and 'ranked 2nd' is stated without any numerical scores, baseline definitions, statistical tests, ablation tables, or result numbers, so the effectiveness of rank-aware selective fusion cannot be evaluated from the manuscript text.
Method description (gating module): no analysis, ablation, or diagnostic is supplied showing that the attention-based top-n selection reliably retains low-salience but necessary cues when emotions are blended mixtures rather than dominant signals; this directly bears on whether the subsequent presence/salience fusion operates on complete representations.

minor comments (1)

Abstract: 'na"ive' should be rendered as 'naïve' for typographic consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: Abstract: the central claim that the framework 'outperforms strong individual encoders and naïve multi-encoder fusion baselines' and 'ranked 2nd' is stated without any numerical scores, baseline definitions, statistical tests, ablation tables, or result numbers, so the effectiveness of rank-aware selective fusion cannot be evaluated from the manuscript text.

Authors: We agree that the abstract would be strengthened by including key quantitative results. The full manuscript contains tables with performance metrics against individual encoders and naïve fusion baselines, plus the competition ranking. In revision we will add concise numerical highlights to the abstract (e.g., relative gains and ranking) within length constraints to make the claims directly evaluable from the abstract text. revision: yes
Referee: Method description (gating module): no analysis, ablation, or diagnostic is supplied showing that the attention-based top-n selection reliably retains low-salience but necessary cues when emotions are blended mixtures rather than dominant signals; this directly bears on whether the subsequent presence/salience fusion operates on complete representations.

Authors: This is a fair observation. While the overall empirical results support the selective fusion approach, the manuscript does not provide a dedicated diagnostic or ablation focused on retention of low-salience cues specifically in blended-emotion cases. We will add such an analysis (e.g., case studies or contribution metrics for lower-ranked encoders on mixed-emotion samples) in the revised version to directly address this concern. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper presents an empirical framework for blended emotion recognition using attention-based gating for encoder selection, decoupled presence/salience heads, probability-level fusion, and unsupervised domain adaptation. No equations, fitted parameters, or self-citations are shown that would reduce any prediction or claim to an input by construction. Performance is evaluated via external competition results on BlEmoRE, which are independent of the method's internal definitions. The derivation chain is self-contained as a set of architectural choices validated by experiment rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard deep-learning assumptions about feature complementarity and attention mechanisms; no explicit free parameters or invented entities are named in the abstract.

free parameters (1)

top-n threshold
The number of encoders retained after gating is a tunable hyperparameter whose value is not stated.

axioms (1)

domain assumption Pre-extracted encoder features from diverse models are complementary for blended emotion cues.
Invoked by the selective fusion design.

pith-pipeline@v0.9.1-grok · 5689 in / 1246 out tokens · 38627 ms · 2026-06-30T17:03:45.273642+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 7 canonical work pages · 6 internal anchors

[1]

Baevski, Y

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020

2020
[2]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Baltrusaitis, A

T. Baltrusaitis, A. Zadeh, Y . C. Lim, and L.-P. Morency. Openface 2.0: Facial behavior analysis toolkit. In2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 59–66, 2018

2018
[4]

Baltru ˇsaitis, P

T. Baltru ˇsaitis, P. Robinson, and L.-P. Morency. Openface: An open source facial behavior analysis toolkit. In2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–10, 2016

2016
[5]

L. F. Barrett, K. A. Lindquist, and M. Gendron. Language as context for the perception of emotion.Trends in Cognitive Sciences, 11(8):327–332, 2007

2007
[6]

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei. Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

2022
[7]

Cheng, Z

H. Cheng, Z. Zhao, Y . He, Z. Hu, J. Li, M. Wang, and R. Hong. Vaemo: Efficient representation learning for visual-audio emotion with knowledge injection. InProceedings of the 33rd ACM International Conference on Multimedia, pages 5547–5556, 2025

2025
[8]

Darwin.The Expression of the Emotions in Man and Animals

C. Darwin.The Expression of the Emotions in Man and Animals. John Murray, 1872
[9]

S. Du, Y . Tao, and A. M. Martinez. Compound facial expressions of emotion.Proceedings of the National Academy of Sciences, 111(15):E1454–E1462, 2014

2014
[10]

P. Ekman. An argument for basic emotions.Cognition & Emotion, 6(3-4):169–200, 1992

1992
[11]

Ekman and D

P. Ekman and D. Cordaro. What is meant by calling emotions basic. Emotion Review, 3(4):364–370, 2011

2011
[12]

Ganin, E

Y . Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavio- lette, M. Marchand, and V . Lempitsky. Domain-adversarial training of neural networks.Journal of Machine Learning Research, 17(59):1–35, 2016

2016
[13]

Girdhar, A

R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra. Imagebind one embedding space to bind them all. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023

2023
[14]

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021

2021
[15]

J. Hu, L. Mathur, P. P. Liang, and L.-P. Morency. Openface 3.0: A lightweight multitask system for comprehensive facial behavior analysis. pages 1–11, 2025

2025
[16]

Israelsson, A

A. Israelsson, A. Seiger, and P. Laukka. Blended emotions can be accurately recognized from dynamic facial and vocal expressions. Journal of Nonverbal Behavior, 47(3):267–284, 2023

2023
[17]

S. K. Khare, V . Blanes-Vidal, E. S. Nadimi, and U. R. Acharya. Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations.Information Fusion, 102:102019, 2024

2014
[18]

D. Kollias. Multi-label compound expression recognition: C-expr database & network. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5589–5598, 2023

2023
[19]

Kornblith, M

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. InInternational conference on machine learning, pages 3519–3529. PMlR, 2019

2019
[20]

Lachmann, A

T. Lachmann, A. Israelsson, C. Tornberg, T. Saghinadze, M. Balazia, P. M¨uller, and P. Laukka. Not all blends are equal: The blemore dataset of blended emotion expressions with relative salience annotations, 2026

2026
[21]

H. Lian, C. Lu, S. Li, Y . Zhao, C. Tang, and Y . Zong. A survey of deep learning-based multimodal emotion recognition: Speech, text, and face.Entropy, 25(10):1440, 2023

2023
[22]

Z. Lian, L. Sun, Y . Ren, H. Gu, H. Sun, L. Chen, B. Liu, and J. Tao. Merbench: A unified evaluation benchmark for multimodal emotion recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–18, 2026

2026
[23]

K. A. Lindquist, J. K. MacCormack, and H. Shablack. The role of language in emotion: Predictions from psychological constructionism. Frontiers in Psychology, 6:121301, 2015

2015
[24]

X. Mai, J. Lin, H. Wang, Z. Tao, et al. All rivers run into the sea: Unified modality brain-inspired emotional central mechanism. InPro- ceedings of the 32nd ACM International Conference on Multimedia, pages 632–641, 2024

2024
[25]

Moeller, Z

J. Moeller, Z. Ivcevic Pringle, and A. White. Mixed emotions: Network analyses of intra-individual co-occurrences within and across situations.Emotion, 18:1106–1121, 2018

2018
[26]

Oatley and E

K. Oatley and E. Duncan. The experience of emotions in everyday life.Cognition & Emotion, 8(4):369–381, 1994

1994
[27]

Oh and E

V . Oh and E. Tong. Specificity in the study of mixed emotions: A theoretical framework.Personality and Social Psychology Review, 26(4):283–314, 2022

2022
[28]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Priyasad, T

D. Priyasad, T. Fernando, S. Denman, S. Sridharan, and C. Fookes. Attention driven fusion for multi-modal emotion recognition. pages 3227–3231, 2020

2020
[30]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, et al. Learning transfer- able visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, 2021

2021
[31]

Radford, J

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervi- sion. InInternational conference on machine learning, pages 28492– 28518. PMLR, 2023

2023
[32]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

DINOv3

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

L. Sun, Z. Lian, B. Liu, and J. Tao. Mae-dfer: Efficient masked au- toencoder for self-supervised dynamic facial expression recognition. In Proceedings of the 31st ACM International Conference on Multimedia, pages 6110–6121, 2023

2023
[35]

L. Sun, Z. Lian, B. Liu, and J. Tao. Hicmae: Hierarchical contrastive masked autoencoder for self-supervised audio-visual emotion recog- nition.Information Fusion, 108:102382, 2024

2024
[36]

Q. Sun, Y . Fang, L. Wu, X. Wang, and Y . Cao. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

L. Wang, B. Huang, Z. Zhao, Z. Tong, Y . He, Y . Wang, Y . Wang, and Y . Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14560, 2023

2023
[38]

P. Yang, N. Liu, X. Liu, Y . Shu, et al. A multimodal dataset for mixed emotion recognition.Scientific Data, 11, 2024

2024
[39]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11941–11952, 2023

2023
[40]

J. Zhao, Q. Yang, Y . Peng, D. Bai, et al. Humanomni: A large vision- speech language model for human-centric video understanding.arXiv preprint arXiv:2501.15111, 2025

work page arXiv 2025
[41]

Zhao and J

Y . Zhao and J. Xu. Compound micro-expression recognition system. In2020 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS), pages 728–733, 2020

2020
[42]

J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. APPENDIX A. Related Work Psychological foundations of blended emotions.Classic theories describe basic emotions as distinguishable af...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Baevski, Y

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020

2020

[2] [2]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Baltrusaitis, A

T. Baltrusaitis, A. Zadeh, Y . C. Lim, and L.-P. Morency. Openface 2.0: Facial behavior analysis toolkit. In2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 59–66, 2018

2018

[4] [4]

Baltru ˇsaitis, P

T. Baltru ˇsaitis, P. Robinson, and L.-P. Morency. Openface: An open source facial behavior analysis toolkit. In2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–10, 2016

2016

[5] [5]

L. F. Barrett, K. A. Lindquist, and M. Gendron. Language as context for the perception of emotion.Trends in Cognitive Sciences, 11(8):327–332, 2007

2007

[6] [6]

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei. Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

2022

[7] [7]

Cheng, Z

H. Cheng, Z. Zhao, Y . He, Z. Hu, J. Li, M. Wang, and R. Hong. Vaemo: Efficient representation learning for visual-audio emotion with knowledge injection. InProceedings of the 33rd ACM International Conference on Multimedia, pages 5547–5556, 2025

2025

[8] [8]

Darwin.The Expression of the Emotions in Man and Animals

C. Darwin.The Expression of the Emotions in Man and Animals. John Murray, 1872

[9] [9]

S. Du, Y . Tao, and A. M. Martinez. Compound facial expressions of emotion.Proceedings of the National Academy of Sciences, 111(15):E1454–E1462, 2014

2014

[10] [10]

P. Ekman. An argument for basic emotions.Cognition & Emotion, 6(3-4):169–200, 1992

1992

[11] [11]

Ekman and D

P. Ekman and D. Cordaro. What is meant by calling emotions basic. Emotion Review, 3(4):364–370, 2011

2011

[12] [12]

Ganin, E

Y . Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavio- lette, M. Marchand, and V . Lempitsky. Domain-adversarial training of neural networks.Journal of Machine Learning Research, 17(59):1–35, 2016

2016

[13] [13]

Girdhar, A

R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra. Imagebind one embedding space to bind them all. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023

2023

[14] [14]

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021

2021

[15] [15]

J. Hu, L. Mathur, P. P. Liang, and L.-P. Morency. Openface 3.0: A lightweight multitask system for comprehensive facial behavior analysis. pages 1–11, 2025

2025

[16] [16]

Israelsson, A

A. Israelsson, A. Seiger, and P. Laukka. Blended emotions can be accurately recognized from dynamic facial and vocal expressions. Journal of Nonverbal Behavior, 47(3):267–284, 2023

2023

[17] [17]

S. K. Khare, V . Blanes-Vidal, E. S. Nadimi, and U. R. Acharya. Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations.Information Fusion, 102:102019, 2024

2014

[18] [18]

D. Kollias. Multi-label compound expression recognition: C-expr database & network. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5589–5598, 2023

2023

[19] [19]

Kornblith, M

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. InInternational conference on machine learning, pages 3519–3529. PMlR, 2019

2019

[20] [20]

Lachmann, A

T. Lachmann, A. Israelsson, C. Tornberg, T. Saghinadze, M. Balazia, P. M¨uller, and P. Laukka. Not all blends are equal: The blemore dataset of blended emotion expressions with relative salience annotations, 2026

2026

[21] [21]

H. Lian, C. Lu, S. Li, Y . Zhao, C. Tang, and Y . Zong. A survey of deep learning-based multimodal emotion recognition: Speech, text, and face.Entropy, 25(10):1440, 2023

2023

[22] [22]

Z. Lian, L. Sun, Y . Ren, H. Gu, H. Sun, L. Chen, B. Liu, and J. Tao. Merbench: A unified evaluation benchmark for multimodal emotion recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–18, 2026

2026

[23] [23]

K. A. Lindquist, J. K. MacCormack, and H. Shablack. The role of language in emotion: Predictions from psychological constructionism. Frontiers in Psychology, 6:121301, 2015

2015

[24] [24]

X. Mai, J. Lin, H. Wang, Z. Tao, et al. All rivers run into the sea: Unified modality brain-inspired emotional central mechanism. InPro- ceedings of the 32nd ACM International Conference on Multimedia, pages 632–641, 2024

2024

[25] [25]

Moeller, Z

J. Moeller, Z. Ivcevic Pringle, and A. White. Mixed emotions: Network analyses of intra-individual co-occurrences within and across situations.Emotion, 18:1106–1121, 2018

2018

[26] [26]

Oatley and E

K. Oatley and E. Duncan. The experience of emotions in everyday life.Cognition & Emotion, 8(4):369–381, 1994

1994

[27] [27]

Oh and E

V . Oh and E. Tong. Specificity in the study of mixed emotions: A theoretical framework.Personality and Social Psychology Review, 26(4):283–314, 2022

2022

[28] [28]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Priyasad, T

D. Priyasad, T. Fernando, S. Denman, S. Sridharan, and C. Fookes. Attention driven fusion for multi-modal emotion recognition. pages 3227–3231, 2020

2020

[30] [30]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, et al. Learning transfer- able visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, 2021

2021

[31] [31]

Radford, J

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervi- sion. InInternational conference on machine learning, pages 28492– 28518. PMLR, 2023

2023

[32] [32]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[33] [33]

DINOv3

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

L. Sun, Z. Lian, B. Liu, and J. Tao. Mae-dfer: Efficient masked au- toencoder for self-supervised dynamic facial expression recognition. In Proceedings of the 31st ACM International Conference on Multimedia, pages 6110–6121, 2023

2023

[35] [35]

L. Sun, Z. Lian, B. Liu, and J. Tao. Hicmae: Hierarchical contrastive masked autoencoder for self-supervised audio-visual emotion recog- nition.Information Fusion, 108:102382, 2024

2024

[36] [36]

Q. Sun, Y . Fang, L. Wu, X. Wang, and Y . Cao. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

L. Wang, B. Huang, Z. Zhao, Z. Tong, Y . He, Y . Wang, Y . Wang, and Y . Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14560, 2023

2023

[38] [38]

P. Yang, N. Liu, X. Liu, Y . Shu, et al. A multimodal dataset for mixed emotion recognition.Scientific Data, 11, 2024

2024

[39] [39]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11941–11952, 2023

2023

[40] [40]

J. Zhao, Q. Yang, Y . Peng, D. Bai, et al. Humanomni: A large vision- speech language model for human-centric video understanding.arXiv preprint arXiv:2501.15111, 2025

work page arXiv 2025

[41] [41]

Zhao and J

Y . Zhao and J. Xu. Compound micro-expression recognition system. In2020 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS), pages 728–733, 2020

2020

[42] [42]

J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. APPENDIX A. Related Work Psychological foundations of blended emotions.Classic theories describe basic emotions as distinguishable af...

work page internal anchor Pith review Pith/arXiv arXiv 2025