arxiv: 2602.16161 · v3 · submitted 2026-02-18 · 💻 cs.MM · cs.CL· cs.LG

Emotion Collider: Dual Hyperbolic Mirror Manifolds for Sentiment Recovery via Anti Emotion Reflection

Rong Fu , Ziming Wang , Shuo Yin , Haiyun Wei , Kun Liu , Xianda Li , Zeli Su , Simon Fong This is my paper

Pith reviewed 2026-05-15 21:36 UTC · model grok-4.3

classification 💻 cs.MM cs.CLcs.LG

keywords multimodal emotion recognitionhyperbolic embeddingshypergraph neural networksPoincare ballcontrastive learningsentiment analysisaffective computing

0 comments

The pith

Hyperbolic hypergraphs with Poincare-ball embeddings recover multimodal emotions more accurately than Euclidean baselines, especially with missing or noisy data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Emotion Collider (EC-Net), a framework that places modality hierarchies into Poincare-ball embeddings to reflect their natural tree-like structure in emotion data. It then fuses the modalities with a hypergraph that passes messages bidirectionally between nodes and hyperedges, keeping higher-order relations across time and channels intact. Contrastive learning is performed directly in hyperbolic space by separating radial and angular losses to pull same-emotion samples closer and push others apart. The result is more resilient representations that maintain accuracy on standard benchmarks even when one or more input streams are absent or corrupted. Readers care because everyday human-computer interfaces rarely receive clean, complete signals from all sensors at once.

Core claim

Emotion Collider represents modality hierarchies using Poincare-ball embeddings and performs fusion through a hypergraph mechanism that passes messages bidirectionally between nodes and hyperedges. To sharpen class separation, contrastive learning is formulated in hyperbolic space with decoupled radial and angular objectives. High-order semantic relations across time steps and modalities are preserved via adaptive hyperedge construction, producing robust representations that improve accuracy on multimodal emotion benchmarks particularly when modalities are partially available or contaminated by noise.

What carries the argument

Poincare-ball embeddings for hierarchical modality geometry combined with bidirectional hypergraph message passing and adaptive hyperedge construction for cross-modal fusion.

Load-bearing premise

That Poincare-ball embeddings plus bidirectional hypergraph message passing will preserve high-order semantic relations across time steps and modalities better than existing Euclidean or graph baselines.

What would settle it

A controlled experiment on a multimodal benchmark where one modality is progressively removed or replaced with Gaussian noise; if EC-Net accuracy falls to or below the Euclidean or graph baseline, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2602.16161 by Haiyun Wei, Kun Liu, Rong Fu, Shuo Yin, Simon Fong, Xianda Li, Zeli Su, Ziming Wang.

**Figure 2.** Figure 2: Radar summary across six missing patterns and three metrics (Acc2 / F1 / MAE). EC-Net shows consistent [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Training trajectories for principal losses (mean across three seeds). Task loss, reconstruction loss, property [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Stacked bar plot showing Acc2 drops for each ablation across FIX and MR regimes. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Histogram of principal angles θ(Σ, µ) after training (50 bins). The distribution concentrates near small angles (mean ≈ 3.8 ◦ ). 4.4 Main results with full modalities [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Single-factor hyperparameter scans showing Acc2 versus the swept factor. The default operating point is [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Heatmap of Acc2 as a function of curvature ratio and orthogonality penalty, revealing a stable plateau. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Training curve under 8 random seeds (shaded area = min/max envelope). The small envelope confirms stable [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Two test samples with the highest geometric-asymmetry score [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Poincaré-disk scatter of randomly sampled emotion embeddings (1,000 points) colored by label; superim [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Mirror-space t-SNE: left original hE, right mapped fψ(gϕ(hE)); gray lines connect corresponding points. Small cycle distances indicate good involution behaviour. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: Corruption robustness: bars show performance under clean, light and heavy corruption conditions for several [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Peak GPU memory vs. Acc2 Pareto plot across batch sizes. EC-Net occupies a favourable memory-accuracy [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗

**Figure 14.** Figure 14: Training curve with mean and 95% confidence band from three seeds. Low variance indicates stable training [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

read the original abstract

Emotional expression underpins natural communication and effective human-computer interaction. We present Emotion Collider (EC-Net), a hyperbolic hypergraph framework for multimodal emotion and sentiment modeling. EC-Net represents modality hierarchies using Poincare-ball embeddings and performs fusion through a hypergraph mechanism that passes messages bidirectionally between nodes and hyperedges. To sharpen class separation, contrastive learning is formulated in hyperbolic space with decoupled radial and angular objectives. High-order semantic relations across time steps and modalities are preserved via adaptive hyperedge construction. Empirical results on standard multimodal emotion benchmarks show that EC-Net produces robust, semantically coherent representations and consistently improves accuracy, particularly when modalities are partially available or contaminated by noise. These findings indicate that explicit hierarchical geometry combined with hypergraph fusion is effective for resilient multimodal affect understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EC-Net combines Poincare embeddings with bidirectional hypergraph fusion and decoupled hyperbolic contrastive learning to target robustness in multimodal emotion recognition when modalities are missing or noisy.

read the letter

The main takeaway is that this paper introduces EC-Net, a framework that embeds modality hierarchies in Poincare balls, fuses them via bidirectional hypergraph message passing, and sharpens separation with contrastive objectives split into radial and angular parts in hyperbolic space. The abstract claims this preserves high-order relations across time and modalities and delivers better accuracy on standard benchmarks, especially under partial or noisy inputs. That combination is not a routine extension of prior work, so the specific architecture counts as the new piece. It does a reasonable job framing a practical goal: making affect models more resilient in real-world settings where data is incomplete. The geometric choice aligns with the hierarchical nature of emotional signals, and the hypergraph step aims to capture relations that simple graphs might miss. The stress-test note is right that the described components form a coherent pipeline without obvious internal contradictions. The soft spots are straightforward. We only have the abstract, so there are no derivations, ablation tables, or error-bar details to inspect. Claims of consistent gains rest on empirical results that cannot be evaluated yet, and the title's references to dual mirror manifolds and anti-emotion reflection are not explained here. If the full paper supplies those controls and shows the improvements are not just from stronger baselines, the contribution holds; otherwise the gains could be modest. This is for readers already working on geometric deep learning or multimodal affect systems who want to test whether hyperbolic methods add robustness. It deserves peer review because the problem is relevant and the architecture is clearly motivated, even if the evidence needs checking.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes EC-Net, a hyperbolic hypergraph framework for multimodal emotion and sentiment modeling. Modality hierarchies are represented via Poincaré-ball embeddings; fusion occurs through bidirectional hypergraph message passing between nodes and hyperedges; contrastive learning is performed in hyperbolic space using decoupled radial and angular objectives; and adaptive hyperedge construction is used to preserve high-order semantic relations across time steps and modalities. The central empirical claim is that the resulting representations are robust and yield consistent accuracy gains on standard multimodal emotion benchmarks, especially under partial modality availability or noise.

Significance. If the claimed gains are reproducible and the architecture is shown to outperform strong Euclidean and graph baselines with proper controls, the work would demonstrate a concrete benefit of combining explicit hyperbolic hierarchy with hypergraph fusion for resilient multimodal affect modeling. This could inform future HCI systems that must operate with incomplete or noisy sensor streams.

major comments (1)

Abstract: the central claim of consistent accuracy improvement 'particularly when modalities are partially available or contaminated by noise' is stated without any quantitative numbers, baseline names, or statistical significance tests. Because the soundness of the empirical support is load-bearing for the paper's contribution, the absence of even summary results prevents verification of whether the hyperbolic-hypergraph combination actually delivers the stated resilience.

minor comments (1)

The title refers to 'Dual Hyperbolic Mirror Manifolds' and 'Anti Emotion Reflection' while the abstract describes EC-Net with Poincaré-ball embeddings and bidirectional hypergraph passing; the manuscript should clarify whether these are the same architecture or whether the title describes a distinct component.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment below and will revise the manuscript to improve the verifiability of our empirical claims.

read point-by-point responses

Referee: Abstract: the central claim of consistent accuracy improvement 'particularly when modalities are partially available or contaminated by noise' is stated without any quantitative numbers, baseline names, or statistical significance tests. Because the soundness of the empirical support is load-bearing for the paper's contribution, the absence of even summary results prevents verification of whether the hyperbolic-hypergraph combination actually delivers the stated resilience.

Authors: We agree that the abstract would be strengthened by including quantitative support for the central claim. In the revised manuscript, we will update the abstract to report specific accuracy improvements (e.g., gains on IEMOCAP and CMU-MOSEI under partial/noisy modality conditions), name the primary baselines, and reference statistical significance where available in the results. This change will make the empirical contribution immediately verifiable without altering the paper's technical content. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The provided abstract and architecture description introduce Poincare-ball embeddings for modality hierarchies, bidirectional hypergraph message passing for fusion, and hyperbolic contrastive learning with decoupled objectives. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or result to the inputs by construction. The central claim of improved resilient multimodal affect understanding is tied to empirical results on standard benchmarks, which constitute external validation rather than internal circular reduction. The derivation remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5448 in / 1018 out tokens · 17471 ms · 2026-05-15T21:36:28.792470+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EC-Net represents modality hierarchies using Poincaré-ball embeddings and performs fusion through a hypergraph mechanism... contrastive learning is formulated in hyperbolic space with decoupled radial and angular objectives.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Differentiable Mirror Layer... learnable involution (gϕ, fψ) with cycle loss and Riemannian importance re-weighting

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 1 internal anchor

[1]

Multimodal sentiment analysis: a survey of methods, trends, and challenges.ACM Computing Surveys, 55(13s):1–38, 2023

Ringki Das and Thoudam Doren Singh. Multimodal sentiment analysis: a survey of methods, trends, and challenges.ACM Computing Surveys, 55(13s):1–38, 2023

work page 2023
[2]

A multi-modal fusion method based on higher-order orthogonal iteration decomposition.Entropy, 23(10):1349, 2021

Fen Liu, Jianfeng Chen, Weijie Tan, and Chang Cai. A multi-modal fusion method based on higher-order orthogonal iteration decomposition.Entropy, 23(10):1349, 2021

work page 2021
[3]

Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis.IEEE Transactions on Affective Computing, 14(3):2276–2289, 2022

Sijie Mai, Ying Zeng, Shuangjia Zheng, and Haifeng Hu. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis.IEEE Transactions on Affective Computing, 14(3):2276–2289, 2022

work page 2022
[4]

Hyperbolic diffusion embedding and distance for hierarchical representation learning

Ya-Wei Eileen Lin, Ronald R Coifman, Gal Mishne, and Ronen Talmon. Hyperbolic diffusion embedding and distance for hierarchical representation learning. InInternational Conference on Machine Learning, pages 21003–21025. PMLR, 2023

work page 2023
[5]

Citenet: Cross-modal incongruity perception network for multimodal sentiment prediction.Knowledge-Based Systems, 295:111848, 2024

Jie Wang, Yan Yang, Keyu Liu, Zhuyang Xie, Fan Zhang, and Tianrui Li. Citenet: Cross-modal incongruity perception network for multimodal sentiment prediction.Knowledge-Based Systems, 295:111848, 2024

work page 2024
[6]

Multimodal sentiment and emotion recognition in hyperbolic space.Expert Systems with Applications, 184:115507, 2021

Keith April Araño, Carlotta Orsenigo, Mauricio Soto, and Carlo Vercellis. Multimodal sentiment and emotion recognition in hyperbolic space.Expert Systems with Applications, 184:115507, 2021

work page 2021
[7]

Label-aware hyperbolic embeddings for fine-grained emotion classification

Chih-Yao Chen, Tun Min Hung, Yi-Li Hsu, and Lun-Wei Ku. Label-aware hyperbolic embeddings for fine-grained emotion classification. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10947–10958, 2023

work page 2023
[8]

Deep Multimodal Learning with Missing Modality: A Survey

Renjie Wu, Hu Wang, Hsiang-Ting Chen, and Gustavo Carneiro. Deep multimodal learning with missing modality: A survey.arXiv preprint arXiv:2409.07825, 2024

work page internal anchor Pith review arXiv 2024
[9]

Missing modality robustness in semi-supervised multi-modal semantic segmentation

Harsh Maheshwari, Yen-Cheng Liu, and Zsolt Kira. Missing modality robustness in semi-supervised multi-modal semantic segmentation. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1020–1030, 2024

work page 2024
[10]

A study of dropout-induced modality bias on robustness to missing video frames for audio-visual speech recognition

Yusheng Dai, Hang Chen, Jun Du, Ruoyu Wang, Shihao Chen, Haotian Wang, and Chin-Hui Lee. A study of dropout-induced modality bias on robustness to missing video frames for audio-visual speech recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27445–27455, 2024

work page 2024
[11]

Probing bert in hyperbolic spaces.arXiv preprint arXiv:2104.03869, 2021

Boli Chen, Yao Fu, Guangwei Xu, Pengjun Xie, Chuanqi Tan, Mosha Chen, and Liping Jing. Probing bert in hyperbolic spaces.arXiv preprint arXiv:2104.03869, 2021

work page arXiv 2021
[12]

Hype-han: Hyperbolic hierarchical attention network for semantic embedding

Chengkun Zhang and Junbin Gao. Hype-han: Hyperbolic hierarchical attention network for semantic embedding. InProceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 3990–3996, 2021

work page 2021
[13]

Petracker: Poincaré-based dual-strategy emotion tracker for emotion recognition in conversation.IEEE Transactions on Affective Computing, 2025

YuKun Cao, Luobin Huang, and Yijia Tang. Petracker: Poincaré-based dual-strategy emotion tracker for emotion recognition in conversation.IEEE Transactions on Affective Computing, 2025

work page 2025
[14]

Multimodal hyperbolic embedding and hyperbolic hypergraph fusion for emotion recognition in conversation

Yao Zheng, Guowei Chen, Wenchao Song, Yanchao Liu, and Pengzhou Zhang. Multimodal hyperbolic embedding and hyperbolic hypergraph fusion for emotion recognition in conversation. InProceedings of the 7th ACM International Conference on Multimedia in Asia, pages 1–8, 2025

work page 2025
[15]

Smil: Multimodal learning with severely missing modality

Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. Smil: Multimodal learning with severely missing modality. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 2302–2310, 2021. 17 Emotion Collider

work page 2021
[16]

Dealing with all-stage missing modality: Towards a universal model with robust reconstruction and personalization.arXiv preprint arXiv:2406.01987, 2024

Yunpeng Zhao, Cheng Chen, Qing You Pang, Quanzheng Li, Carol Tang, Beng-Ti Ang, and Yueming Jin. Dealing with all-stage missing modality: Towards a universal model with robust reconstruction and personalization.arXiv preprint arXiv:2406.01987, 2024

work page arXiv 2024
[17]

Multimodal hypergraph network with contrastive learning for sentiment analysis.Neurocomputing, 627:129566, 2025

Jian Huang, Kun Jiang, Yuanyuan Pu, Zhengpeng Zhao, Qiuxia Yang, Jinjing Gu, and Dan Xu. Multimodal hypergraph network with contrastive learning for sentiment analysis.Neurocomputing, 627:129566, 2025

work page 2025
[18]

Microblog sentiment classification via a multilayer graph with social and semantic representations using hyperbolic learning.Information Sciences, page 122993, 2025

Xiaomei Zou, Taihao Li, and Shoukang Han. Microblog sentiment classification via a multilayer graph with social and semantic representations using hyperbolic learning.Information Sciences, page 122993, 2025

work page 2025
[19]

Conformally natural families of probability distributions on hyperbolic disc with a view on geometric deep learning.arXiv preprint arXiv:2407.16733, 2024

Vladimir Jacimovic and Marijan Markovic. Conformally natural families of probability distributions on hyperbolic disc with a view on geometric deep learning.arXiv preprint arXiv:2407.16733, 2024

work page arXiv 2024
[20]

Generative modeling on manifolds through mixture of riemannian diffusion processes.arXiv preprint arXiv:2310.07216, 2023

Jaehyeong Jo and Sung Ju Hwang. Generative modeling on manifolds through mixture of riemannian diffusion processes.arXiv preprint arXiv:2310.07216, 2023

work page arXiv 2023
[21]

Hypformer: Exploring efficient transformer fully in hyperbolic space

Menglin Yang, Harshit Verma, Delvin Ce Zhang, Jiahong Liu, Irwin King, and Rex Ying. Hypformer: Exploring efficient transformer fully in hyperbolic space. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3770–3781, 2024

work page 2024
[22]

Hyperbolic vision transformers: Combining improvements in metric learning

Aleksandr Ermolov, Leyla Mirvakhabova, Valentin Khrulkov, Nicu Sebe, and Ivan Oseledets. Hyperbolic vision transformers: Combining improvements in metric learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7409–7419, 2022

work page 2022
[23]

Generalization error bound for hyperbolic ordinal embedding

Atsushi Suzuki, Atsushi Nitanda, Jing Wang, Linchuan Xu, Kenji Yamanishi, and Marc Cavazza. Generalization error bound for hyperbolic ordinal embedding. InInternational Conference on Machine Learning, pages 10011– 10021. PMLR, 2021

work page 2021
[24]

Generalizing knowledge graph embedding with universal orthogonal parameterization.arXiv preprint arXiv:2405.08540, 2024

Rui Li, Chaozhuo Li, Yanming Shen, Zeyu Zhang, and Xu Chen. Generalizing knowledge graph embedding with universal orthogonal parameterization.arXiv preprint arXiv:2405.08540, 2024

work page arXiv 2024
[25]

Analyzing modality robustness in multimodal sentiment analysis.arXiv preprint arXiv:2205.15465, 2022

Devamanyu Hazarika, Yingting Li, Bo Cheng, Shuai Zhao, Roger Zimmermann, and Soujanya Poria. Analyzing modality robustness in multimodal sentiment analysis.arXiv preprint arXiv:2205.15465, 2022

work page arXiv 2022
[26]

Deception in the eyes of deceiver: A computer vision and machine learning based automated deception detection.Expert Systems with Applications, 169:114341, 2021

Wasiq Khan, Keeley Crockett, James O’Shea, Abir Hussain, and Bilal M Khan. Deception in the eyes of deceiver: A computer vision and machine learning based automated deception detection.Expert Systems with Applications, 169:114341, 2021

work page 2021
[27]

Disentanglement of correlated factors via hausdorff factorized support.arXiv preprint arXiv:2210.07347, 2022

Karsten Roth, Mark Ibrahim, Zeynep Akata, Pascal Vincent, and Diane Bouchacourt. Disentanglement of correlated factors via hausdorff factorized support.arXiv preprint arXiv:2210.07347, 2022

work page arXiv 2022
[28]

When is unsupervised disentanglement possible?Advances in Neural Information Processing Systems, 34:5150–5161, 2021

Daniella Horan, Eitan Richardson, and Yair Weiss. When is unsupervised disentanglement possible?Advances in Neural Information Processing Systems, 34:5150–5161, 2021

work page 2021
[29]

Task arithmetic in the tangent space: Improved editing of pre-trained models.Advances in Neural Information Processing Systems, 36:66727–66754, 2023

Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models.Advances in Neural Information Processing Systems, 36:66727–66754, 2023

work page 2023
[30]

Set transformer: A framework for attention-based permutation-invariant neural networks

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. InInternational conference on machine learning, pages 3744–3753. PMLR, 2019

work page 2019
[31]

Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages.IEEE Intelligent Systems, 31(6):82–88, 2016

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages.IEEE Intelligent Systems, 31(6):82–88, 2016

work page 2016
[32]

Memory fusion network for multi-view sequential learning

Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Memory fusion network for multi-view sequential learning. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018
[33]

Iemocap: Interactive emotional dyadic motion capture database

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359, 2008

work page 2008
[34]

Unimse: Towards unified multimodal sentiment analysis and emotion recognition.arXiv preprint arXiv:2211.11256, 2022

Guimin Hu, Ting-En Lin, Yi Zhao, Guangming Lu, Yuchuan Wu, and Yongbin Li. Unimse: Towards unified multimodal sentiment analysis and emotion recognition.arXiv preprint arXiv:2211.11256, 2022. 18 Emotion Collider

work page arXiv 2022
[35]

Confede: Contrastive feature decomposition for multimodal sentiment analysis

Jiuding Yang, Yakun Yu, Di Niu, Weidong Guo, and Yu Xu. Confede: Contrastive feature decomposition for multimodal sentiment analysis. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7617–7630, 2023

work page 2023
[36]

Learning from the global view: Supervised contrastive learning of multimodal representation.Information Fusion, 100:101920, 2023

Sijie Mai, Ying Zeng, and Haifeng Hu. Learning from the global view: Supervised contrastive learning of multimodal representation.Information Fusion, 100:101920, 2023

work page 2023
[37]

Hydiscgan: A hybrid distributed cgan for audio-visual privacy preservation in multimodal sentiment analysis.arXiv preprint arXiv:2404.11938, 2024

Zhuojia Wu, Qi Zhang, Duoqian Miao, Kun Yi, Wei Fan, and Liang Hu. Hydiscgan: A hybrid distributed cgan for audio-visual privacy preservation in multimodal sentiment analysis.arXiv preprint arXiv:2404.11938, 2024

work page arXiv 2024
[38]

Clgsi: a multimodal sentiment analysis framework based on contrastive learning guided by sentiment intensity

Yang Yang, Xunde Dong, and Yupeng Qiang. Clgsi: a multimodal sentiment analysis framework based on contrastive learning guided by sentiment intensity. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2099–2110, 2024

work page 2024
[39]

Dlf: Disentangled-language-focused multimodal sentiment analysis

Pan Wang, Qiang Zhou, Yawen Wu, Tianlong Chen, and Jingtong Hu. Dlf: Disentangled-language-focused multimodal sentiment analysis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 21180–21188, 2025

work page 2025
[40]

Pamoe-msa: polarity-aware mixture of experts network for multimodal sentiment analysis.International Journal of Multimedia Information Retrieval, 14(1):1–16, 2025

Changqin Huang, Zhenheng Lin, Zhongmei Han, Qionghao Huang, Fan Jiang, and Xiaodi Huang. Pamoe-msa: polarity-aware mixture of experts network for multimodal sentiment analysis.International Journal of Multimedia Information Retrieval, 14(1):1–16, 2025

work page 2025
[41]

Msamba: Exploring multimodal sentiment analysis with state space models

Xilin He, Haijian Liang, Boyi Peng, Weicheng Xie, Muhammad Haris Khan, Siyang Song, and Zitong Yu. Msamba: Exploring multimodal sentiment analysis with state space models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 1309–1317, 2025

work page 2025
[42]

Two-stage finetuning of wav2vec 2.0 for speech emotion recognition with asr and gender pretraining

Yuan Gao, Chenhui Chu, and Tatsuya Kawahara. Two-stage finetuning of wav2vec 2.0 for speech emotion recognition with asr and gender pretraining. InProc. Interspeech, pages 3637–3641, 2023

work page 2023
[43]

Learning robust self-attention features for speech emotion recognition with label-adaptive mixup

Lei Kang, Lichao Zhang, and Dazhi Jiang. Learning robust self-attention features for speech emotion recognition with label-adaptive mixup. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

work page 2023
[44]

Improving speech emotion recognition with unsupervised speaking style transfer

Leyuan Qu, Wei Wang, Cornelius Weber, Pengcheng Yue, Taihao Li, and Stefan Wermter. Improving speech emotion recognition with unsupervised speaking style transfer. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10101–10105. IEEE, 2024

work page 2024
[45]

Leveraging knowledge of modality experts for incomplete multimodal learning

Wenxin Xu, Hexin Jiang, and Xuefeng Liang. Leveraging knowledge of modality experts for incomplete multimodal learning. InProceedings of the 32nd ACM International Conference on Multimedia, pages 438–446, 2024

work page 2024
[46]

Apin: Amplitude-and phase-aware interaction network for speech emotion recognition.Speech Communication, 169:103201, 2025

Lili Guo, Jie Li, Shifei Ding, and Jianwu Dang. Apin: Amplitude-and phase-aware interaction network for speech emotion recognition.Speech Communication, 169:103201, 2025

work page 2025
[47]

Individual-aware attention modulation for unseen speaker emotion recognition.IEEE Transactions on Affective Computing, 2024

Yuanbo Fang, Xiaofen Xing, Zhaojie Chu, Yifeng Du, and Xiangmin Xu. Individual-aware attention modulation for unseen speaker emotion recognition.IEEE Transactions on Affective Computing, 2024

work page 2024
[48]

Gatem 2 former: Gated feature selection and expert modeling in multimodal emotion recognition

Weixiang Xu, Zhongren Dong, Runming Wang, Xinzhou Xu, and Zixing Zhang. Gatem 2 former: Gated feature selection and expert modeling in multimodal emotion recognition. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025
[49]

Seenet: A soft emotion expert and data augmentation method to enhance speech emotion recognition.IEEE Transactions on Affective Computing, 2025

Qifei Li, Yingming Gao, Yuhua Wen, Ziping Zhao, Ya Li, and Björn W Schuller. Seenet: A soft emotion expert and data augmentation method to enhance speech emotion recognition.IEEE Transactions on Affective Computing, 2025

work page 2025
[50]

Gcnet: Graph completion network for incomplete multimodal learning in conversation.IEEE Transactions on pattern analysis and machine intelligence, 45(7): 8419–8432, 2023

Zheng Lian, Lan Chen, Licai Sun, Bin Liu, and Jianhua Tao. Gcnet: Graph completion network for incomplete multimodal learning in conversation.IEEE Transactions on pattern analysis and machine intelligence, 45(7): 8419–8432, 2023

work page 2023
[51]

Incomplete multimodality-diffused emotion recognition.Advances in Neural Information Processing Systems, 36:17117–17128, 2023

Yuanzhi Wang, Yong Li, and Zhen Cui. Incomplete multimodality-diffused emotion recognition.Advances in Neural Information Processing Systems, 36:17117–17128, 2023

work page 2023
[52]

Towards robust multimodal sentiment analysis with incomplete data.Advances in Neural Information Processing Systems, 37:55943–55974, 2024

Haoyu Zhang, Wenbin Wang, and Tianshu Yu. Towards robust multimodal sentiment analysis with incomplete data.Advances in Neural Information Processing Systems, 37:55943–55974, 2024. 19 Emotion Collider

work page 2024
[53]

Enhanced experts with uncertainty- aware routing for multimodal sentiment analysis

Zixian Gao, Disen Hu, Xun Jiang, Huimin Lu, Heng Tao Shen, and Xing Xu. Enhanced experts with uncertainty- aware routing for multimodal sentiment analysis. InProceedings of the 32nd ACM International Conference on Multimedia, pages 9650–9659, 2024

work page 2024
[54]

Cider: Consensus-based image description evaluation

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. A Theoretical Details A.1 Radial scaling is an inter-curvature diffeomorphism Proposition.Let Bc1 ={x∈R n :∥x∥<1/ √c1} and Bc2 ={x∈R n :∥x∥<1...

work page 2015