pith. machine review for the scientific record. sign in

arxiv: 2604.12735 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition

Bo Zhao, Fei Ma, Haochen Liang, Hui Ma, Jiayu Zhang, Qi Tian, Taorui Wang, Wei Xia, Yijie Zhu, Zeheng Wang, Zhishu Liu, Zitong Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal emotion recognitionmulti-agent systemsretrieval-augmented generationcollaborative reasoningaffective computingMAPPOmodality balancing
0
0 comments X

The pith

AffectAgent uses three collaborating agents to retrieve and reason about multimodal emotions more accurately than single-round methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that static LLM memory and single-round retrieval-augmented generation fail on nuanced emotions because of modal ambiguity and cross-modal dependencies. It introduces AffectAgent, a framework in which a query planner, an evidence filter, and an emotion generator jointly retrieve cross-modal samples, evaluate evidence, and produce predictions. These agents are trained end-to-end with MAPPO under a shared affective reward, while MB-MoE balances modality contributions and RAAF adds retrieved embeddings when data is missing. Experiments on MER-UniBench show gains in complex cases, implying that task decomposition plus joint optimization can reduce hallucinations in affective understanding.

Core claim

AffectAgent is an affect-oriented multi-agent retrieval-augmented generation framework comprising a query planner, an evidence filter, and an emotion generator that collaboratively retrieve cross-modal samples, assess evidence, and generate predictions; the agents are optimized end-to-end with MAPPO using a shared affective reward, augmented by MB-MoE to regulate modality contributions and RAAF to complete semantics under missing modalities.

What carries the argument

The three-agent collaborative structure (query planner, evidence filter, emotion generator) jointly optimized via MAPPO with a shared affective reward, which decomposes retrieval-augmented reasoning into planning, filtering, and generation steps for consistent cross-modal affective understanding.

If this is right

  • AffectAgent achieves superior performance across complex scenarios on MER-UniBench.
  • MB-MoE dynamically balances modality contributions to reduce representation mismatch from cross-modal heterogeneity.
  • RAAF improves semantic completion under missing-modality conditions by incorporating retrieved audiovisual embeddings.
  • End-to-end MAPPO optimization with the shared reward produces more consistent emotion predictions than independent agent training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same agent decomposition could be tested on other ambiguous multimodal tasks such as medical image-text diagnosis.
  • Shared-reward reinforcement learning may generalize as a way to align specialized agents in retrieval-augmented pipelines beyond emotion recognition.
  • Scaling experiments on larger or real-time multimodal streams would show whether the added coordination cost remains acceptable.

Load-bearing premise

Joint optimization via MAPPO with a shared affective reward will align the three agents to produce consistent cross-modal understanding without coordination failures or reward exploitation that hides underlying modal ambiguity.

What would settle it

If the system shows no accuracy gain over single-round RAG baselines or exhibits persistent disagreements among the three agents when tested on a dataset of known cross-modal ambiguities, the claim of reliable collaborative affective understanding would not hold.

Figures

Figures reproduced from arXiv: 2604.12735 by Bo Zhao, Fei Ma, Haochen Liang, Hui Ma, Jiayu Zhang, Qi Tian, Taorui Wang, Wei Xia, Yijie Zhu, Zeheng Wang, Zhishu Liu, Zitong Yu.

Figure 1
Figure 1. Figure 1: (a) Vanilla LLMs [6, 24] rely on static memory, struggling with nuanced emotion understanding. (b) Traditional RAG-enhanced models [32, 44] follow a linear retrieve-then-generate pipeline, which is inade￾quate for complex emotional reasoning. (c) Our Affec￾tAgent adopts a collaborative multi-agent architecture that orchestrates query planning, retrieval, filtering, and generation for emotion recognition. i… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of AffectAgent. AffectAgent performs multimodal emotion recognition through collaborative [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of three representative case studies, showing smooth success, query deviation with a correct [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of RAAF and MB-MoE in Af￾fectAgent on sampled MER2023 data with AffectGPT as the backbone. (a) Performance gain (pp) under dif￾ferent retrieval conditions. (b) Average expert routing weights under different input states. (V), audio-led (A), and cross-modal conflict (Conf.), where E1–E4 denote the four experts. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

LLM-based multimodal emotion recognition relies on static parametric memory and often hallucinates when interpreting nuanced affective states. In this paper, given that single-round retrieval-augmented generation is highly susceptible to modal ambiguity and therefore struggles to capture complex affective dependencies across modalities, we introduce AffectAgent, an affect-oriented multi-agent retrieval-augmented generation framework that leverages collaborative decision-making among agents for fine-grained affective understanding. Specifically, AffectAgent comprises three jointly optimized specialized agents, namely a query planner, an evidence filter, and an emotion generator, which collaboratively perform analytical reasoning to retrieve cross-modal samples, assess evidence, and generate predictions. These agents are optimized end-to-end using Multi-Agent Proximal Policy Optimization (MAPPO) with a shared affective reward to ensure consistent emotion understanding. Furthermore, we introduce Modality-Balancing Mixture of Experts (MB-MoE) and Retrieval-Augmented Adaptive Fusion (RAAF), where MB-MoE dynamically regulates the contributions of different modalities to mitigate representation mismatch caused by cross-modal heterogeneity, while RAAF enhances semantic completion under missing-modality conditions by incorporating retrieved audiovisual embeddings. Extensive experiments on MER-UniBench demonstrate that AffectAgent achieves superior performance across complex scenarios. Our code will be released at: https://github.com/Wz1h1NG/AffectAgent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes AffectAgent, a collaborative multi-agent RAG framework for multimodal emotion recognition to address modal ambiguity in LLM-based systems. It features three agents (query planner, evidence filter, emotion generator) optimized end-to-end via MAPPO with a shared affective reward. Additional modules MB-MoE and RAAF are introduced for modality balancing and adaptive fusion. The paper reports superior performance on the MER-UniBench dataset across complex scenarios.

Significance. If validated, the results would indicate that multi-agent collaborative reasoning can improve fine-grained affective understanding in retrieval-augmented settings. This has potential significance for affective computing and multi-agent RL applications. The planned code release is a strength that aids reproducibility.

major comments (3)
  1. [Abstract] Abstract: The claim of superior performance lacks any quantitative support, such as specific accuracy or F1 scores, baseline comparisons, or p-values, which is essential for substantiating the effectiveness of the proposed framework.
  2. [§3.3] §3.3: The MAPPO training with shared reward is described without addressing non-stationarity or providing evidence that it leads to stable collaboration and distinct roles among agents, which is load-bearing for the collaborative decision-making claim.
  3. [§4.2] §4.2: No ablation studies are presented that isolate the MAPPO multi-agent component from MB-MoE and RAAF, making it impossible to determine if the performance gains stem from the collaborative reasoning or other elements.
minor comments (1)
  1. [Abstract] Abstract: The description of RAAF could be clarified by specifying how retrieved audiovisual embeddings are incorporated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the presentation of our results and the analysis of the multi-agent framework. We address each major comment below and will revise the manuscript accordingly to strengthen the claims and supporting evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of superior performance lacks any quantitative support, such as specific accuracy or F1 scores, baseline comparisons, or p-values, which is essential for substantiating the effectiveness of the proposed framework.

    Authors: We agree that the abstract would be strengthened by including quantitative results. The full manuscript contains detailed tables with accuracy, F1 scores, and baseline comparisons on MER-UniBench, but these are not summarized in the abstract. In the revised version, we will update the abstract to report key metrics (e.g., overall accuracy and F1 improvements) and note the statistical significance where applicable. revision: yes

  2. Referee: [§3.3] §3.3: The MAPPO training with shared reward is described without addressing non-stationarity or providing evidence that it leads to stable collaboration and distinct roles among agents, which is load-bearing for the collaborative decision-making claim.

    Authors: We acknowledge that §3.3 focuses on the architectural setup and shared reward design but does not explicitly discuss non-stationarity or include empirical evidence of stable collaboration. In the revision, we will expand this section to explain how MAPPO's centralized value function and the shared affective reward mitigate non-stationarity, and we will add training curves or convergence metrics demonstrating stable policy updates and the differentiation of agent roles (query planner, evidence filter, emotion generator) over the course of training. revision: yes

  3. Referee: [§4.2] §4.2: No ablation studies are presented that isolate the MAPPO multi-agent component from MB-MoE and RAAF, making it impossible to determine if the performance gains stem from the collaborative reasoning or other elements.

    Authors: We agree that isolating the MAPPO multi-agent optimization from the MB-MoE and RAAF modules is necessary to attribute performance gains precisely. The current experiments evaluate the full system against external baselines but lack internal ablations that disable or replace the collaborative MAPPO component. In the revised manuscript, we will add ablation studies (e.g., single-agent MAPPO variants or non-collaborative optimization baselines) while keeping MB-MoE and RAAF fixed, to quantify the contribution of the multi-agent reasoning. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical multi-agent framework with no derivations or self-referential reductions

full rationale

The paper describes an empirical framework (AffectAgent with query planner, evidence filter, and emotion generator optimized via MAPPO and shared reward, plus MB-MoE and RAAF modules) evaluated on MER-UniBench. No mathematical equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text. Claims rest on experimental results rather than any chain that reduces to its own inputs by construction. This is a standard empirical ML contribution with independent experimental validation and no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all such elements would require the full manuscript for identification.

pith-pipeline@v0.9.0 · 5564 in / 1133 out tokens · 45215 ms · 2026-05-10T15:17:48.982601+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 22 canonical work pages · 5 internal anchors

  1. [1]

    Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, and Mahdieh Baghshah. 2025. Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation. InFindings of the Association for Computational Linguistics: ACL 2025

  2. [2]

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Ha- jishirzi. 2024. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. InProceedings of the 12th International Confer- ence on Learning Representations (ICLR)

  3. [3]

    Julia Belikova et al. 2024. DeepPavlov at SemEval-2024 Task 3: Multi- modal Large Language Models in Emotion Reasoning. InProceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

  4. [4]

    Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W. Co- hen. 2022. MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). 5558–5570

  5. [5]

    Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, and Jiaxin Mao. 2025. MMoA: Improving Retrieval-Augmented Generation through Multi- Agent Reinforcement Learning. InAdvances in Neural Information Pro- cessing Systems (NeurIPS)

  6. [6]

    Hauptmann

    Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander G. Hauptmann. 2024. Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning. InAdvances in Neural Information Processing Sys- tems

  7. [7]

    Zesen Cheng, Sicong Li, Jiaming Chen, Fuzhao Chen, Hang Zhang, Yuzi Dong, Mude Sun, Xin Li, Mingyu Gao, Jianwei Liu, et al . 2024. VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Un- derstanding in Video-LLMs.arXiv preprint arXiv:2406.07476(2024)

  8. [8]

    Zebang Cheng, Yuxiang Lin, Zhiqi Chen, Xiang Li, Shuyi Mao, Fan Zhang, Dongdong Ding, Bowen Zhang, and Xiaojiang Peng. 2023. Semi- supervised multimodal emotion recognition with expression MAE. In Proceedings of the 31st ACM International Conference on Multimedia. 9436–9440

  9. [9]

    Roddy Cowie, Ellen Douglas-Cowie, Nicolas Tsapatsoulis, George Votsis, Stefanos Kollias, Winfried Fellenz, and John G Taylor. 2001. Emotion recognition in human-computer interaction.IEEE Signal Processing Magazine18, 1 (2001), 32–80

  10. [10]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

  11. [11]

    InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies

    BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies. 4171–4186

  12. [12]

    Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases.Pattern Recognition44, 3 (2011), 572–587

  13. [13]

    Yun Gao, Yun Xiong, Meng Wang, and Haofen Wang. 2024. Modu- lar RAG: Transforming RAG Systems into Lego-like Reconfigurable Frameworks.arXiv preprint arXiv:2407.21059(2024)

  14. [14]

    Wei Han, Hui Chen, and Soujanya Poria. 2021. Improving multimodal fusion with hierarchical mutual information maximization for mul- timodal sentiment analysis. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9180–9192

  15. [15]

    Ross, Cordelia Schmid, and Karteek Alahari

    Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David A. Ross, Cordelia Schmid, and Karteek Alahari. 2023. REVEAL: Retrieval- Augmented Visual-Language Pre-Training with Multi-Source Multi- modal Knowledge Memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 23369–23379

  16. [16]

    Ao Jia, Yu He, Yazhou Zhang, Sagar Uprety, Dawei Song, and Christina Lioma. 2022. Beyond emotion: A multi-modal dataset for human desire understanding. InProceedings of the 2022 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies. 1512–1522

  17. [17]

    Xu, Luyu Gao, Zhiqing Sun, Qiao Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig

    Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qiao Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active Retrieval Augmented Generation. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing (EMNLP). 7969–7992

  18. [18]

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs.IEEE Transactions on Big Data7, 3 (2019), 535–547

  19. [19]

    Zixuan Ke, Weize Kong, Cheng Li, Mingyang Zhang, Qiaozhu Mei, and Michael Bendersky. 2024. Bridging the Preference Gap between Retrievers and LLMs.arXiv preprint arXiv:2401.06954(2024)

  20. [20]

    Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, and Marinka Zitnik. 2026. Token Reduction Should Go Beyond Efficiency in Generative Models – From Vision, Language to Multimodality. arXiv:2505.18227 [cs.LG] https://arxiv.org/abs/2505.18227

  21. [21]

    Shanglin Lei, Guanting Dong, Xiaoping Wang, Keheng Wang, and Sirui Wang. 2023. InstructERC: Reforming emotion recognition in conver- sation with a retrieval multi-task LLMs framework.arXiv preprint arXiv:2309.11911(2023)

  22. [22]

    Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. 2025. VideoChat: Chat-centric video understanding.Science China Information Sciences68, 10 (2025), 200102

  23. [23]

    Shan Li and Weihong Deng. 2022. Deep facial expression recognition: A survey.IEEE Transactions on Affective Computing13, 3 (2022), 1195– 1215

  24. [24]

    Xinze Li, Sen Mei, Zhenghao Liu, Yukun Yan, Shuo Wang, Shi Yu, Zheni Zeng, Hao Chen, Ge Yu, Zhiyuan Liu, Maosong Sun, and Chenyan Xiong

  25. [25]

    InProceedings of the 13th International Conference on Learning Representations (ICLR)

    RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards. InProceedings of the 13th International Conference on Learning Representations (ICLR)

  26. [26]

    Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, Jiangyan Yi, and Jianhua Tao. 2025. AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models. InProceedings of the 42nd International Conference on Machine Learning

  27. [27]

    Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Ru, Zhaopeng Tu, Shuming Shi, and Xing Xu. 2024. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)

  28. [28]

    Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient low- rank multimodal fusion with modality-specific factors. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2247–2256

  29. [29]

    Zhishu Liu, Kaishen Yuan, Bo Zhao, Hui Ma, and Zitong Yu. 2026. AULLM++: Structural Reasoning with Large Language Models for Micro-Expression Recognition. arXiv:2603.08387 [cs.CV] https://arxiv. org/abs/2603.08387

  30. [30]

    Zhishu Liu, Kaishen Yuan, Bo Zhao, Yong Xu, and Zitong Yu. 2025. Au-llm: Micro-expression action unit detection via enhanced llm-based feature fusion. InChinese Conference on Biometric Recognition. Springer, 355–365

  31. [31]

    Xinxi Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. 2023. Query rewriting for retrieval-augmented large language models.arXiv Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al. preprint arXiv:2305.14283(2023)

  32. [32]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wain- wright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Chris- tiano, Jan Leike, and Ryan Lowe. 2022. Training Language Models to Follow Instructions with...

  33. [33]

    Rosalind W Picard, Elias Vyzas, and Jennifer Healey. 2001. Toward machine emotional intelligence: Analysis of affective physiological state.IEEE Transactions on Pattern Analysis and Machine Intelligence23, 10 (2001), 1175–1191

  34. [34]

    Vittorio Pipoli, Alessia Saporita, Federico Bolelli, Marcella Cornia, Lorenzo Baraldi, Costantino Grana, Rita Cucchiara, and Elisa Ficarra

  35. [35]

    InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)

    MISSRAG: Addressing the Missing Modality Challenge in Multi- modal Large Language Models. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)

  36. [36]

    Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal emotion recognition.Information Fusion37 (2017), 98–125

  37. [37]

    Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, AmirAli Bagher Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque

  38. [38]

    InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics

    Integrating multimodal information in large pretrained trans- formers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2359–2369

  39. [39]

    Björn W Schuller. 2018. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends.Commun. ACM61, 5 (2018), 90–99

  40. [40]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. 2016. High-Dimensional Continuous Control Using Generalized Advantage Estimation. InProc. Int. Conf. Learn. Represent. (ICLR). https://arxiv.org/abs/1506.02438

  41. [41]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms (PPO).arXiv preprint arXiv:1707.06347(2017)

  42. [42]

    Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller, Shih- Fu Chang, and Maja Pantic. 2017. A survey of multimodal emotion recognition.Image and Vision Computing65 (2017), 1–14

  43. [43]

    Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai

  44. [44]

    Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355(2023)

  45. [45]

    Ma, Muqiao Yang, Ruslan Salakhutdi- nov, and Louis-Philippe Morency

    Yao-Hung Hubert Tsai, Martin Q. Ma, Muqiao Yang, Ruslan Salakhutdi- nov, and Louis-Philippe Morency. 2020. Multimodal routing: Improving local and global interpretability of multimodal language analysis. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing

  46. [46]

    Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao

  47. [47]

    Region attention networks for pose and occlusion robust facial expression recognition.IEEE Transactions on Image Processing29 (2020), 4057–4069

  48. [48]

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Ma- jumder, and Furu Wei. 2024. Multilingual E5 Text Embeddings: A Technical Report.arXiv preprint arXiv:2402.05672(2024)

  49. [49]

    Tao Wang, X. Lin, Y. Xu, Q. Ye, D. Guo, Sergio Escalera, G. Khoriba, and Z. Yu. 2026. Micro-gesture Recognition: A Comprehensive Survey of Datasets, Methods, and Challenges.Machine Intelligence Research23, 2 (2026), 308–331. doi:10.1007/s11633-025-1629-x

  50. [50]

    Zhuofan Wen, Zheng Lian, Shun Chen, Hailiang Yao, Longjiang Yang, Bin Liu, and Jianhua Tao. 2025. Listen, Watch, and Learn to Feel: Retrieval-Augmented Emotion Reasoning for Compound Emotion Gen- eration. InFindings of the Association for Computational Linguistics: ACL 2025

  51. [51]

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, and Chi Wang

  52. [52]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.arXiv preprint arXiv:2308.08155(2023)

  53. [53]

    Peng Xia, Guilin Qi, et al . 2024. Multimodal Retrieval-Augmented Generation: A Survey.arXiv preprint arXiv:2402.03573(2024)

  54. [54]

    Lei Xin, Caiyun Huang, Hao Li, Shihong Huang, Yuling Feng, Zhenglun Kong, Zicheng Liu, Siyuan Li, Chang Yu, Fei Shen, and Hao Tang

  55. [55]

    arXiv:2412.12668 [q-bio.GN] https: //arxiv.org/abs/2412.12668

    Artificial Intelligence for Central Dogma-Centric Multi-Omics: Challenges and Breakthroughs. arXiv:2412.12668 [q-bio.GN] https: //arxiv.org/abs/2412.12668

  56. [56]

    Lei Xin, Yuhao Zheng, Ke Cheng, Changjiang Jiang, Zifan Zhang, and Fanhu Zeng. 2026. HyTRec: A Hybrid Temporal-Aware Atten- tion Architecture for Long Behavior Sequential Recommendation. arXiv:2602.18283 [cs.IR] https://arxiv.org/abs/2602.18283

  57. [57]

    Ye, and B

    Qu Yang, M. Ye, and B. Du. 2024. EmoLLM: Multimodal Emo- tional Understanding Meets Large Language Models.arXiv preprint arXiv:2406.16442(2024)

  58. [58]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InProceedings of the 11th International Conference on Learning Representations (ICLR)

  59. [59]

    Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Richard James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen- tau Yih. 2023. Retrieval-Augmented Multimodal Language Modeling. InProceedings of the 40th International Conference on Machine Learning (ICML). 39755–39769

  60. [60]

    Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexan- dre Bayen, and Yi Wu. 2022. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35. 24611–24624

  61. [61]

    Zheng Yuan, Haotian Liu, Chunyuan Li, et al. 2025. A Survey on Multi- modal Retrieval-Augmented Generation.arXiv preprint arXiv:2504.08748 (2025)

  62. [62]

    Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis- Philippe Morency. 2017. Tensor fusion network for multimodal sen- timent analysis. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1103–1114

  63. [63]

    AmirAli Bagher Zadeh, Paul Pu Liang, Jonathan Vanbriesen, Soujanya Poria, Edmund Tong, Erik Cambria, Minghai Chen, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2236–2246

  64. [64]

    Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understand- ing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations. 443–453

  65. [65]

    Sitao Zhang, Yimu Pan, and James Z. Wang. 2023. Learning emotion rep- resentations from verbal and nonverbal communication. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18993–19004

  66. [66]

    Dongming Zhao, Weian Liu, Haisong Zhang, Zhengyuan Wei, Sihuang Zhao, Mingwei Xu, Jun Yin, Jinpeng Lu, and Yu Qiao. 2023. Chat- bridge: Bridging modalities to large language models via object-centric representations.arXiv preprint arXiv:2305.13101(2023)

  67. [67]

    Zengqun Zhao, Qingshan Liu, and Feng Zhou. 2022. Robust facial expression recognition: A survey.IEEE Transactions on Affective Com- puting13, 4 (2022), 1805–1823

  68. [68]

    Ziwang Zheng, Tangli Jiao, Zebang Cheng, et al. 2023. EmoLLaVA: A Large Vision-Language Model for Emotion Recognition.arXiv preprint arXiv:2312.14415(2023)

  69. [69]

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elho- seiny. 2024. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. InInternational Conference on Learning Representations

  70. [70]

    Yijie Zhu, Jie He, Rui Shao, Kaishen Yuan, Tao Tan, Xiaochen Yuan, and Zitong Yu. 2026. ΔVLA: Prior-Guided Vision-Language-Action Models via World Knowledge Variation.arXiv preprint arXiv:2603.08361(2026)

  71. [71]

    Yijie Zhu, Yibo Lyu, Zitong Yu, Rui Shao, Kaiyang Zhou, and Liqiang Nie. 2025. EmoSym: A Symbiotic Framework for Unified Emotional Understanding and Generation via Latent Reasoning. InProceedings of the 33nd ACM International Conference on Multimedia

  72. [72]

    Yijie Zhu, Rui Shao, Ziyang Liu, Jie He, Jizhihui Liu, Jiuru Wang, and Zitong Yu. 2026. H-GAR: A Hierarchical Interaction Framework via Goal-Driven Observation-Action Refinement for Robotic Manipulation. InProceedings of the AAAI Conference on Artificial Intelligence

  73. [73]

    Yijie Zhu, Lingsen Zhang, Zitong Yu, Rui Shao, Tao Tan, and Liqiang Nie. 2025. UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries.arXiv preprint arXiv:2507.23372(2025)