arxiv: 2604.12518 · v2 · submitted 2026-04-14 · 💻 cs.CL

Recognition: unknown

Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis

Kang He , Yuzhe Ding , Xinrong Wang , Fei Li , Chong Teng , Donghong Ji

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:45 UTC · model grok-4.3

classification 💻 cs.CL

keywords multimodal sentiment analysismodality collaborationmissing modalitiescross-modal enhancementsentiment analysismultimodal fusionrobustness

0 comments

The pith

A two-stage enhance-then-balance process strengthens weaker signals and prevents dominant modalities from overshadowing them in multimodal sentiment analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix the common issue in multimodal sentiment analysis where text often dominates audio and visual inputs, causing incomplete fusion and fragility when some signals are missing or noisy. It introduces an Enhance-then-Balance Modality Collaboration framework that first disentangles semantics and applies cross-modal enhancement to raise the quality of weaker modality representations. It then applies energy-guided coordination to implicitly rebalance gradients and trust distillation to adjust fusion weights according to per-sample reliability. If the approach holds, models would integrate heterogeneous signals more evenly and retain performance under incomplete inputs. Readers would care because practical emotion inference from video, speech, and text routinely encounters missing or low-quality channels.

Core claim

The EBMC model improves representation quality via semantic disentanglement and cross-modal enhancement to strengthen weaker modalities, then employs an Energy-guided Modality Coordination mechanism that achieves implicit gradient rebalancing through a differentiable equilibrium objective together with Instance-aware Modality Trust Distillation that estimates sample-level reliability to modulate fusion weights adaptively, producing state-of-the-art or competitive accuracy and strong results under missing-modality conditions.

What carries the argument

The Enhance-then-Balance Modality Collaboration framework, which first lifts weaker-modality representations through disentanglement and cross-modal enhancement, then coordinates contributions via energy-based equilibrium objectives and reliability-weighted distillation to reduce modality competition.

If this is right

Multimodal fusion can achieve higher overall accuracy by reducing the overshadowing effect of stronger modalities on weaker ones.
Models remain effective for emotion inference even when audio, visual, or text channels are absent or corrupted.
Implicit gradient rebalancing removes the need for manual modality-specific hyperparameters during training.
Adaptive trust weighting improves sample-level reliability estimation across varied data distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same strengthen-then-rebalance pattern could extend to other multimodal tasks such as visual question answering where one modality frequently dominates.
Testing the framework on real-world social-media data with naturally occurring missing channels would reveal its practical limits beyond controlled benchmarks.
The trust-distillation component might transfer to settings like federated multimodal learning to handle device-specific signal quality differences.

Load-bearing premise

The premise that semantic disentanglement plus cross-modal enhancement will reliably improve weaker modalities and that energy-guided coordination plus trust distillation will rebalance contributions without introducing new overfitting or bias on real data.

What would settle it

On standard multimodal sentiment benchmarks such as CMU-MOSI or CMU-MOSEI, randomly remove one modality from test samples and check whether EBMC maintains higher accuracy or F1 than strong baselines; a substantial drop would falsify the robustness claim.

Figures

Figures reproduced from arXiv: 2604.12518 by Chong Teng, Donghong Ji, Fei Li, Kang He, Xinrong Wang, Yuzhe Ding.

**Figure 2.** Figure 2: The overall architecture of our proposed model EBMC. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: T-SNE visualization of features distrubution on CMU-MOSI. The closer color is to red, the more positive sentiment. EBMC [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The impact of EMC on modality contributions and over [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of EBMC and baseline predictions on the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Hyperparameter sensitivity analysis of EBMC on the CMU-MOSEI dataset. The effects of varying [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Performance curves on the CMU-MOSI dataset under increasing modality missing rates. The four subplots respectively report [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Performance curves on the CMU-MOSEI dataset under increasing modality missing rates. The four subplots respectively report [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

Multimodal sentiment analysis (MSA) integrates heterogeneous text, audio, and visual signals to infer human emotions. While recent approaches leverage cross-modal complementarity, they often struggle to fully utilize weaker modalities. In practice, dominant modalities tend to overshadow non-verbal ones, inducing modality competition and limiting overall contributions. This imbalance degrades fusion performance and robustness under noisy or missing modalities. To address this, we propose a novel model, Enhance-then-Balance Modality Collaboration framework (EBMC). EBMC improves representation quality via semantic disentanglement and cross-modal enhancement, strengthening weaker modalities. To prevent dominant modalities from overwhelming others, an Energy-guided Modality Coordination mechanism achieves implicit gradient rebalancing via a differentiable equilibrium objective. Furthermore, Instance-aware Modality Trust Distillation estimates sample-level reliability to adaptively modulate fusion weights, ensuring robustness. Extensive experiments demonstrate that EBMC achieves state-of-the-art or competitive results and maintains strong performance under missing-modality settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EBMC gives a workable pipeline for boosting and implicitly balancing modalities in sentiment analysis, but the balancing step lacks direct validation beyond final accuracy numbers.

read the letter

The main point is that this paper puts forward a clear enhance-then-balance strategy for multimodal sentiment analysis to stop stronger modalities like text from drowning out audio and visual signals. It first disentangles semantics and uses cross-modal enhancement to strengthen the weaker ones, then applies energy-guided coordination through a differentiable equilibrium loss for implicit rebalancing, plus instance-aware trust distillation to tweak fusion weights per sample. That specific three-part setup is the new element; prior work has touched on adaptive weighting or enhancement, but not this exact sequence aimed at both quality and equilibrium without explicit reweighting terms. The experiments back it up with state-of-the-art or competitive results on standard datasets and noticeably better robustness when modalities are dropped, which is a practical win for real-world use where inputs are often incomplete. The approach is straightforward to follow and targets a genuine pain point in fusion. The soft spot sits with the energy-guided coordination. It claims to equalize modality influence via the equilibrium objective, yet the reported tables focus only on downstream F1 and accuracy plus missing-modality cases. There is no isolated ablation of that equilibrium term alone, no before-and-after gradient norm stats per modality, and no check that the fixed point actually shifts influence rather than serving as generic regularization. If those measurements were added they would strengthen the central claim; without them the balance part rests on indirect evidence. This work suits researchers building multimodal emotion systems who care about missing-data robustness and would benefit from seeing the pipeline in action. It is worth sending to peer review because the problem is well-motivated, the method is reproducible in principle, and the robustness results are useful even if the balancing mechanism needs tighter scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper proposes the Enhance-then-Balance Modality Collaboration (EBMC) framework for multimodal sentiment analysis. It first strengthens weaker modalities through semantic disentanglement and cross-modal enhancement, then applies an Energy-guided Modality Coordination mechanism that uses a differentiable equilibrium objective for implicit gradient rebalancing to prevent dominant modalities from overshadowing others. An Instance-aware Modality Trust Distillation component estimates sample-level reliability to adaptively modulate fusion weights. The authors claim that extensive experiments show EBMC achieves state-of-the-art or competitive results while maintaining strong performance under missing-modality conditions.

Significance. If the central mechanisms are shown to deliver the claimed rebalancing and robustness, the work could meaningfully advance multimodal sentiment analysis by providing a practical approach to modality competition and imbalance, a persistent issue in real-world deployments with noisy or incomplete inputs. The staged enhance-then-balance design and the use of an equilibrium objective represent a potentially useful direction for implicit coordination without explicit per-modality weights.

major comments (2)

[Section 3.3] Section 3.3: The Energy-guided Modality Coordination is presented as the key mechanism for implicit gradient rebalancing via a differentiable equilibrium objective. Yet Tables 2–4 report only downstream accuracy/F1 and missing-modality robustness; there are no ablations that isolate the equilibrium term, no measurements or statistics of per-modality gradient norms before/after the mechanism, and no verification that the fixed point equalizes influence rather than acting as generic regularization. This leaves the central 'balance' claim without direct empirical support.
[Experimental sections (Tables 2–4)] Experimental sections (Tables 2–4 and associated text): The manuscript asserts SOTA or competitive results and robustness, but provides no details on the number of random seeds, statistical significance tests, error bars, or component-wise ablations (disentanglement, enhancement, coordination, distillation). Without these, it is impossible to attribute performance gains specifically to the proposed rebalancing and trust mechanisms rather than to the enhancement stage alone.

minor comments (1)

[Abstract] Abstract: The abstract refers to 'extensive experiments' without naming the datasets, modalities, or primary evaluation metrics; adding one sentence with this information would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and outline revisions that will strengthen the empirical support for our claims.

read point-by-point responses

Referee: [Section 3.3] Section 3.3: The Energy-guided Modality Coordination is presented as the key mechanism for implicit gradient rebalancing via a differentiable equilibrium objective. Yet Tables 2–4 report only downstream accuracy/F1 and missing-modality robustness; there are no ablations that isolate the equilibrium term, no measurements or statistics of per-modality gradient norms before/after the mechanism, and no verification that the fixed point equalizes influence rather than acting as generic regularization. This leaves the central 'balance' claim without direct empirical support.

Authors: We acknowledge that direct measurements such as per-modality gradient norm statistics and an ablation isolating the equilibrium objective would provide stronger verification of the rebalancing effect. The reported gains in accuracy, F1, and missing-modality robustness are consistent with the intended coordination, but we agree these do not constitute isolated evidence. In the revision we will add (i) an ablation removing only the equilibrium term and (ii) before/after gradient-norm statistics across modalities to confirm the fixed point equalizes influence rather than acting as generic regularization. revision: yes
Referee: [Experimental sections (Tables 2–4)] Experimental sections (Tables 2–4 and associated text): The manuscript asserts SOTA or competitive results and robustness, but provides no details on the number of random seeds, statistical significance tests, error bars, or component-wise ablations (disentanglement, enhancement, coordination, distillation). Without these, it is impossible to attribute performance gains specifically to the proposed rebalancing and trust mechanisms rather than to the enhancement stage alone.

Authors: We agree that the current experimental reporting lacks the granularity needed to attribute gains precisely. The manuscript will be revised to (i) state that all results are averaged over 5 random seeds with standard deviations shown as error bars, (ii) include statistical significance tests (paired t-tests with p-values), and (iii) expand the ablation table to evaluate each component in isolation (semantic disentanglement, cross-modal enhancement, energy-guided coordination, and instance-aware trust distillation). These additions will allow clearer separation of the contribution of the balance and trust stages from the enhancement stage. revision: yes

Circularity Check

0 steps flagged

No significant circularity; mechanisms proposed and validated empirically

full rationale

The paper introduces the EBMC framework consisting of semantic disentanglement plus cross-modal enhancement for strengthening weaker modalities, followed by an Energy-guided Modality Coordination mechanism defined via a differentiable equilibrium objective for implicit rebalancing, and Instance-aware Modality Trust Distillation for adaptive weighting. These components are presented as architectural choices whose effectiveness is assessed through downstream experiments on accuracy, F1, and missing-modality robustness rather than through any derivation that reduces the claimed outcomes to fitted inputs or self-referential definitions. No equations are shown that equate a prediction to its own construction, no load-bearing self-citations reduce the central premise to prior unverified work by the same authors, and the abstract and described structure treat the balance and enhancement steps as independent proposals supported by external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on standard domain assumptions in multimodal learning but introduces no new free parameters, invented entities, or ad-hoc axioms beyond the implicit premise that modalities contain complementary semantic information that can be disentangled and rebalanced.

axioms (1)

domain assumption Heterogeneous text, audio, and visual signals contain complementary semantic information that can be disentangled and cross-enhanced to strengthen weaker modalities.
This premise underpins the 'Enhance' stage of the proposed framework.

pith-pipeline@v0.9.0 · 5467 in / 1323 out tokens · 45013 ms · 2026-05-10T14:45:58.883693+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Multimodal lan- guage analysis in the wild: CMU-MOSEI dataset and inter- pretable dynamic fusion graph

AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal lan- guage analysis in the wild: CMU-MOSEI dataset and inter- pretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 2236–2246, 2018

2018
[2]

Iemo- cap: Interactive emotional dyadic motion capture database

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemo- cap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359, 2008

2008
[3]

Cr-gac: Cross-modal recombination via graph-attention col- laborative optimization for multimodal sentiment analysis

Haoran Chen, Jiapeng Liu, Zuhe Li, Yushan Pan, Hongwei Tao, Huaiguang Wu, Yunyang Wang, and Chenguang Yang. Cr-gac: Cross-modal recombination via graph-attention col- laborative optimization for multimodal sentiment analysis. Expert Systems with Applications, page 129805, 2025

2025
[4]

Hyperdi- mensional uncertainty quantification for multimodal uncer- tainty fusion in autonomous vehicles perception

Luke Chen, Junyao Wang, Trier Mortlock, Pramod Khar- gonekar, and Mohammad Abdullah Al Faruque. Hyperdi- mensional uncertainty quantification for multimodal uncer- tainty fusion in autonomous vehicles perception. InProceed- ings of the Computer Vision and Pattern Recognition Confer- ence, pages 22306–22316, 2025

2025
[5]

Revisiting modality imbalance in multimodal pedestrian detection

Arindam Das, Sudip Das, Ganesh Sistu, Jonathan Horgan, Ujjwal Bhattacharya, Edward Jones, Martin Glavin, and Ciar´an Eising. Revisiting modality imbalance in multimodal pedestrian detection. In2023 IEEE International Conference on Image Processing (ICIP), pages 1755–1759, 2023

2023
[6]

Covarep—a collaborative voice analysis repository for speech technologies

Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. Covarep—a collaborative voice analysis repository for speech technologies. In2014 ieee in- ternational conference on acoustics, speech and signal pro- cessing (icassp), pages 960–964, 2014

2014
[7]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019

2019
[8]

Zero-shot conversational stance detection: Dataset and approaches

Yuzhe Ding, Kang He, Bobo Li, Li Zheng, Haijun He, Fei Li, Chong Teng, and Donghong Ji. Zero-shot conversational stance detection: Dataset and approaches. InFindings of the Association for Computational Linguistics: ACL 2025, pages 3221–3235, 2025

2025
[9]

Pmr: Prototypical modal rebalance for multi- modal learning

Yunfeng Fan, Wenchao Xu, Haozhao Wang, Junxiao Wang, and Song Guo. Pmr: Prototypical modal rebalance for multi- modal learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20029– 20038, 2023

2023
[10]

Detached and interactive multimodal learn- ing

Yunfeng Fan, Wenchao Xu, Haozhao Wang, Junhong Liu, and Song Guo. Detached and interactive multimodal learn- ing. InProceedings of the 32nd ACM International Confer- ence on Multimedia, pages 5470–5478, 2024

2024
[11]

Emoe: Modality-specific enhanced dynamic emotion experts

Yiyang Fang, Wenke Huang, Guancheng Wan, Kehua Su, and Mang Ye. Emoe: Modality-specific enhanced dynamic emotion experts. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14314–14324, 2025

2025
[12]

Catch your emotion: Sharpening emo- tion perception in multimodal large language models

Yiyang Fang, Jian Liang, Wenke Huang, He Li, Kehua Su, and Mang Ye. Catch your emotion: Sharpening emo- tion perception in multimodal large language models. In Forty-second International Conference on Machine Learn- ing, 2025

2025
[13]

Missing the missing values: The ugly duckling of fairness in machine learning.International Journal of Intelligent Systems, 36(7):3217–3258, 2021

Mart ´ınez-Plumed Fernando, Ferri C`esar, Nieves David, and Hern´andez-Orallo Jos ´e. Missing the missing values: The ugly duckling of fairness in machine learning.International Journal of Intelligent Systems, 36(7):3217–3258, 2021

2021
[14]

Enhanced experts with uncertainty- aware routing for multimodal sentiment analysis

Zixian Gao, Disen Hu, Xun Jiang, Huimin Lu, Heng Tao Shen, and Xing Xu. Enhanced experts with uncertainty- aware routing for multimodal sentiment analysis. InPro- ceedings of the 32nd ACM International Conference on Mul- timedia, pages 9650–9659, 2024

2024
[15]

Embracing unimodal aleatoric un- certainty for robust multimodal fusion

Zixian Gao, Xun Jiang, Xing Xu, Fumin Shen, Yujie Li, and Heng Tao Shen. Embracing unimodal aleatoric un- certainty for robust multimodal fusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26876–26885, 2024

2024
[16]

Improving mul- timodal fusion with hierarchical mutual information maxi- mization for multimodal sentiment analysis.arXiv preprint arXiv:2109.00412, 2021

Wei Han, Hui Chen, and Soujanya Poria. Improving mul- timodal fusion with hierarchical mutual information maxi- mization for multimodal sentiment analysis.arXiv preprint arXiv:2109.00412, 2021

work page arXiv 2021
[17]

Misa: Modality-invariant and-specific representa- tions for multimodal sentiment analysis

Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. Misa: Modality-invariant and-specific representa- tions for multimodal sentiment analysis. InProceedings of the 28th Association for Computing Machinery International Conference on Multimedia, pages 1122–1131, 2020

2020
[18]

Pase: Prototype-aligned calibration and shapley-based equilibrium for multimodal sentiment analy- sis.arXiv preprint arXiv:2511.17585, 2025

Kang He, Boyu Chen, Yuzhe Ding, Fei Li, Chong Teng, and Donghong Ji. Pase: Prototype-aligned calibration and shapley-based equilibrium for multimodal sentiment analy- sis.arXiv preprint arXiv:2511.17585, 2025

work page arXiv 2025
[19]

DALR: Dual-level alignment learning for multimodal sentence representation learning

Kang He, Yuzhe Ding, Haining Wang, Fei Li, Chong Teng, and Donghong Ji. DALR: Dual-level alignment learning for multimodal sentence representation learning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 3586–3601, 2025

2025
[20]

Adaptive unimodal regulation for balanced multimodal in- formation acquisition

Chengxiang Huang, Yake Wei, Zequn Yang, and Di Hu. Adaptive unimodal regulation for balanced multimodal in- formation acquisition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25854– 25863, 2025

2025
[21]

Multimodal prompting with missing modalities for vi- sual recognition

Yi-Lun Lee, Yi-Hsuan Tsai, Wei-Chen Chiu, and Chen-Yu Lee. Multimodal prompting with missing modalities for vi- sual recognition. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 14943–14952, 2023

2023
[22]

Revisiting disentanglement and fusion on modality and context in conversational multi- modal emotion recognition

Bobo Li, Hao Fei, Lizi Liao, Yu Zhao, Chong Teng, Tat-Seng Chua, Donghong Ji, and Fei Li. Revisiting disentanglement and fusion on modality and context in conversational multi- modal emotion recognition. InProceedings of the 31st ACM International Conference on Multimedia, pages 5923–5934, 2023

2023
[23]

A unified self-distillation framework for multimodal sentiment analysis with uncertain missing modalities

Mingcheng Li, Dingkang Yang, Yuxuan Lei, Shunli Wang, Shuaibing Wang, Liuzhen Su, Kun Yang, Yuzheng Wang, Mingyang Sun, and Lihua Zhang. A unified self-distillation framework for multimodal sentiment analysis with uncertain missing modalities. InProceedings of the AAAI conference on artificial intelligence, pages 10074–10082, 2024

2024
[24]

Correlation-decoupled knowledge distillation for multimodal sentiment analy- sis with incomplete modalities

Mingcheng Li, Dingkang Yang, Xiao Zhao, Shuaibing Wang, Yan Wang, Kun Yang, Mingyang Sun, Dongliang Kou, Ziyun Qian, and Lihua Zhang. Correlation-decoupled knowledge distillation for multimodal sentiment analy- sis with incomplete modalities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12458–12468, 2024

2024
[25]

Dpu: Dynamic prototype updating for multimodal out-of-distribution detection

Shawn Li, Huixian Gong, Hao Dong, Tiankai Yang, Zhengzhong Tu, and Yue Zhao. Dpu: Dynamic prototype updating for multimodal out-of-distribution detection. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10193–10202, 2025

2025
[26]

Decoupled multi- modal distilling for emotion recognition

Yong Li, Yuanzhi Wang, and Zhen Cui. Decoupled multi- modal distilling for emotion recognition. InProceedings of the IEEE/CVF Conference on Computer vision and Pattern Recognition, pages 6631–6640, 2023

2023
[27]

Alignmamba: Enhancing multimodal mamba with local and global cross-modal alignment

Yan Li, Yifei Xing, Xiangyuan Lan, Xin Li, Haifeng Chen, and Dongmei Jiang. Alignmamba: Enhancing multimodal mamba with local and global cross-modal alignment. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24774–24784, 2025

2025
[28]

Gcnet: Graph completion network for incomplete mul- timodal learning in conversation.IEEE Transactions on pat- tern analysis and machine intelligence, 45(7):8419–8432, 2023

Zheng Lian, Lan Chen, Licai Sun, Bin Liu, and Jianhua Tao. Gcnet: Graph completion network for incomplete mul- timodal learning in conversation.IEEE Transactions on pat- tern analysis and machine intelligence, 45(7):8419–8432, 2023

2023
[29]

Semi-iin: Semi-supervised intra-inter modal interaction learning net- work for multimodal sentiment analysis

Jinhao Lin, Yifei Wang, Yanwu Xu, and Qi Liu. Semi-iin: Semi-supervised intra-inter modal interaction learning net- work for multimodal sentiment analysis. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1411– 1419, 2025

2025
[30]

Smil: Multimodal learning with severely missing modality

Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. Smil: Multimodal learning with severely missing modality. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 2302–2310, 2021

2021
[31]

Robust multiview multimodal driver monitoring system using masked multi- head self-attention

Yiming Ma, Victor Sanchez, Soodeh Nikan, Devesh Upad- hyay, Bhushan Atote, and Tanaya Guha. Robust multiview multimodal driver monitoring system using masked multi- head self-attention. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2617–2625, 2023

2023
[32]

Visualizing data using t-sne.Journal of machine learning research, 9: 2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9: 2579–2605, 2008

2008
[33]

Harnessing frozen unimodal encoders for flexible multimodal alignment

Mayug Maniparambil, Raiymbek Akshulakov, Yasser Ab- delaziz Dahou Djilali, Sanath Narayan, Ankit Singh, and Noel E O’Connor. Harnessing frozen unimodal encoders for flexible multimodal alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29847–29857, 2025

2025
[34]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[35]

Balanced multimodal learning via on-the-fly gradient modulation

Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, and Di Hu. Balanced multimodal learning via on-the-fly gradient modulation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8238–8247, 2022

2022
[36]

GloVe: Global vectors for word representation

Jeffrey Pennington, Richard Socher, and Christopher Man- ning. GloVe: Global vectors for word representation. InPro- ceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014

2014
[37]

Found in translation: Learn- ing robust joint representations by cyclic translations be- tween modalities

Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnab´as P´oczos. Found in translation: Learn- ing robust joint representations by cyclic translations be- tween modalities. InProceedings of the AAAI conference on artificial intelligence, pages 6892–6899, 2019

2019
[38]

Convolutional mkl based multimodal emotion recogni- tion and sentiment analysis

Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Amir Hus- sain. Convolutional mkl based multimodal emotion recogni- tion and sentiment analysis. In2016 IEEE 16th international conference on data mining (ICDM), pages 439–448, 2016

2016
[39]

Recursive joint cross- modal attention for multimodal fusion in dimensional emo- tion recognition

R Gnana Praveen and Jahangir Alam. Recursive joint cross- modal attention for multimodal fusion in dimensional emo- tion recognition. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 4803–4813, 2024

2024
[40]

Multiemo: An attention- based correlation-aware multimodal fusion framework for emotion recognition in conversations

Tao Shi and Shao-Lun Huang. Multiemo: An attention- based correlation-aware multimodal fusion framework for emotion recognition in conversations. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14752–14766, 2023

2023
[41]

Cubemlp: An mlp-based model for multimodal sentiment analysis and depression estimation

Hao Sun, Hongyi Wang, Jiaqing Liu, Yen-Wei Chen, and Lanfen Lin. Cubemlp: An mlp-based model for multimodal sentiment analysis and depression estimation. InProceed- ings of the 30th ACM international conference on multime- dia, pages 3722–3729, 2022

2022
[42]

A multi-focus-driven multi-branch network for robust multi- modal sentiment analysis

Chuanqi Tao, Jiaming Li, Tianzi Zang, and Peng Gao. A multi-focus-driven multi-branch network for robust multi- modal sentiment analysis. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 1547–1555, 2025

2025
[43]

Enhance modality robustness in text-centric multimodal alignment with adversarial prompting

Yun-Da Tsai, Ting-Yu Yen, Keng-Te Liao, and Shou-De Lin. Enhance modality robustness in text-centric multimodal alignment with adversarial prompting. InProceedings of the AAAI Conference on Artificial Intelligence, pages 27740– 27747, 2025

2025
[44]

Multimodal transformer for unaligned multimodal language sequences

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. InProceedings of the conference. Association for computational linguistics. Meeting, page 6558, 2019

2019
[45]

Cross-modal enhancement network for mul- timodal sentiment analysis.IEEE Transactions on Multime- dia, 25:4909–4921, 2022

Di Wang, Shuai Liu, Quan Wang, Yumin Tian, Lihuo He, and Xinbo Gao. Cross-modal enhancement network for mul- timodal sentiment analysis.IEEE Transactions on Multime- dia, 25:4909–4921, 2022

2022
[46]

Tetfn: A text enhanced transformer fusion network for multimodal sentiment analysis.Pattern Recog- nition, 136:109259, 2023

Di Wang, Xutong Guo, Yumin Tian, Jinhui Liu, LiHuo He, and Xuemei Luo. Tetfn: A text enhanced transformer fusion network for multimodal sentiment analysis.Pattern Recog- nition, 136:109259, 2023

2023
[47]

Refining and synthe- sis: A simple yet effective data augmentation framework for cross-domain aspect-based sentiment analysis

Haining Wang, Kang He, Bobo Li, Lei Chen, Fei Li, Xu Han, Chong Teng, and Donghong Ji. Refining and synthe- sis: A simple yet effective data augmentation framework for cross-domain aspect-based sentiment analysis. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10318–10329, 2024

2024
[48]

Dlf: Disentangled-language-focused multi- modal sentiment analysis

Pan Wang, Qiang Zhou, Yawen Wu, Tianlong Chen, and Jingtong Hu. Dlf: Disentangled-language-focused multi- modal sentiment analysis. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 21180–21188, 2025

2025
[49]

What makes train- ing multi-modal classification networks hard? InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12695–12705, 2020

Weiyao Wang, Du Tran, and Matt Feiszli. What makes train- ing multi-modal classification networks hard? InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12695–12705, 2020

2020
[50]

Words can shift: Dy- namically adjusting word representations using nonverbal behaviors

Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. Words can shift: Dy- namically adjusting word representations using nonverbal behaviors. InProceedings of the AAAI conference on arti- ficial intelligence, pages 7216–7223, 2019

2019
[51]

Trans- modality: An end2end fusion method with transformer for multimodal sentiment analysis

Zilong Wang, Zhaohong Wan, and Xiaojun Wan. Trans- modality: An end2end fusion method with transformer for multimodal sentiment analysis. InProceedings of the web conference 2020, pages 2514–2520, 2020

2020
[52]

Enhancing multimodal sentiment analy- sis for missing modality through self-distillation and unified modality cross-attention

Yuzhe Weng, Haotian Wang, Tian Gao, Kewei Li, Shutong Niu, and Jun Du. Enhancing multimodal sentiment analy- sis for missing modality through self-distillation and unified modality cross-attention. InICASSP 2025-2025 IEEE Inter- national Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 1–5, 2025

2025
[53]

Characterizing and overcoming the greedy nature of learning in multi-modal deep neural net- works

Nan Wu, Stanislaw Jastrzebski, Kyunghyun Cho, and Krzysztof J Geras. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural net- works. InInternational Conference on Machine Learning, pages 24043–24055, 2022

2022
[54]

Enriching multimodal sentiment analysis through textual emotional descriptions of visual-audio con- tent

Sheng Wu, Dongxiao He, Xiaobao Wang, Longbiao Wang, and Jianwu Dang. Enriching multimodal sentiment analysis through textual emotional descriptions of visual-audio con- tent. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1601–1609, 2025

2025
[55]

Towards multi- modal sentiment analysis via hierarchical correlation model- ing with semantic distribution constraints

Qinfu Xu, Yiwei Wei, Chunlei Wu, Leiquan Wang, Shaozu Yuan, Jie Wu, Jing Lu, and Hengyang Zhou. Towards multi- modal sentiment analysis via hierarchical correlation model- ing with semantic distribution constraints. InProceedings of the AAAI Conference on Artificial Intelligence, pages 21788– 21796, 2025

2025
[56]

Confede: Contrastive feature decomposition for multimodal sentiment analysis

Jiuding Yang, Yakun Yu, Di Niu, Weidong Guo, and Yu Xu. Confede: Contrastive feature decomposition for multimodal sentiment analysis. InProceedings of the 61st Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7617–7630, 2023

2023
[57]

Clgsi: a mul- timodal sentiment analysis framework based on contrastive learning guided by sentiment intensity

Yang Yang, Xunde Dong, and Yupeng Qiang. Clgsi: a mul- timodal sentiment analysis framework based on contrastive learning guided by sentiment intensity. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2099–2110, 2024

2024
[58]

Adapting bert for target-oriented multimodal sentiment classification

Jianfei Yu and Jing Jiang. Adapting bert for target-oriented multimodal sentiment classification. InProceedings of the 28th International Joint Conference on Artificial Intelli- gence, 2019

2019
[59]

Learning modality-specific representations with self-supervised multi- task learning for multimodal sentiment analysis

Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. Learning modality-specific representations with self-supervised multi- task learning for multimodal sentiment analysis. InProceed- ings of the AAAI conference on artificial intelligence, pages 10790–10797, 2021

2021
[60]

Conki: Contrastive knowledge injection for multimodal sen- timent analysis

Yakun Yu, Mingjun Zhao, Shi-ang Qi, Feiran Sun, Baoxun Wang, Weidong Guo, Xiaoli Wang, Lei Yang, and Di Niu. Conki: Contrastive knowledge injection for multimodal sen- timent analysis. InFindings of the Association for Computa- tional Linguistics: ACL 2023, pages 13610–13624, 2023

2023
[61]

Transformer- based feature reconstruction network for robust multimodal sentiment analysis

Ziqi Yuan, Wei Li, Hua Xu, and Wenmeng Yu. Transformer- based feature reconstruction network for robust multimodal sentiment analysis. InProceedings of the 29th ACM interna- tional conference on multimedia, pages 4400–4407, 2021

2021
[62]

Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos,

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos.arXiv preprint arXiv:1606.06259, 2016

work page arXiv 2016
[63]

Chinese Men- talBERT: Domain-adaptive pre-training on social media for Chinese mental health text analysis

Wei Zhai, Hongzhi Qi, Qing Zhao, Jianqiang Li, Ziqi Wang, Han Wang, Bing Yang, and Guanghui Fu. Chinese Men- talBERT: Domain-adaptive pre-training on social media for Chinese mental health text analysis. InFindings of the As- sociation for Computational Linguistics: ACL 2024, pages 10574–10585, 2024

2024
[64]

arXiv preprint arXiv:2310.05804 , year=

Haoyu Zhang, Yu Wang, Guanghao Yin, Kejun Liu, Yuanyuan Liu, and Tianshu Yu. Learning language-guided adaptive hyper-modality representation for multimodal sen- timent analysis.arXiv preprint arXiv:2310.05804, 2023

work page arXiv 2023
[65]

Towards ro- bust multimodal sentiment analysis with incomplete data

Haoyu Zhang, Wenbin Wang, and Tianshu Yu. Towards ro- bust multimodal sentiment analysis with incomplete data. Advances in Neural Information Processing Systems, 37: 55943–55974, 2024

2024
[66]

Ecerc: evidence-cause atten- tion network for multi-modal emotion recognition in conver- sation

Tao Zhang and Zhenhua Tan. Ecerc: evidence-cause atten- tion network for multi-modal emotion recognition in conver- sation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2064–2077, 2025

2064
[67]

ESCoT: Towards interpretable emotional support di- alogue systems

Tenggan Zhang, Xinjie Zhang, Jinming Zhao, Li Zhou, and Qin Jin. ESCoT: Towards interpretable emotional support di- alogue systems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13395–13412, 2024

2024
[68]

Modal feature optimization network with prompt for multimodal sentiment analysis

Xiangmin Zhang, Wei Wei, and Shihao Zou. Modal feature optimization network with prompt for multimodal sentiment analysis. InProceedings of the 31st International Confer- ence on Computational Linguistics, pages 4611–4621, 2025

2025
[69]

Glomo: Global-local modal fusion for multimodal sentiment analysis

Yan Zhuang, Yanru Zhang, Zheng Hu, Xiaoyue Zhang, Ji- awen Deng, and Fuji Ren. Glomo: Global-local modal fusion for multimodal sentiment analysis. InProceedings of the 32nd ACM International Conference on Multimedia, pages 1800–1809, 2024. Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis Supplementary Material A. Experi...

2024