arxiv: 2604.18988 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

A Multi-Agent Framework with Structured Reasoning and Reflective Refinement for Multimodal Empathetic Response Generation

Bo Hu, Cheng Ye, Liping Wang, Peipei Song, Weidong Chen, Zhendong Mao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal empathetic response generationmulti-agent frameworkstructured reasoningreflective refinementemotion perceptionempathy errorsIEMOCAPMELD

0 comments

The pith

Multi-agent framework with structured reasoning and reflection generates superior empathetic responses from multimodal inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that empathetic response generation from multimodal data improves when broken into explicit stages of perception, emotion forecasting, strategy planning, and generation, followed by global reflection to correct biases. This structured multi-agent approach addresses the flaws of direct one-pass generation, which skips the natural hierarchy of human emotional understanding and often introduces errors due to ambiguity in emotions. A reader might care because it suggests a path to more reliable AI that responds supportively in conversations involving text, voice, or visuals. If the claim holds, AI systems could better match human empathy levels on standard tests.

Core claim

The authors present a closed-loop multi-agent framework for multimodal empathetic response generation that uses a structured empathetic reasoning-to-generation module to decompose the task into multimodal perception, consistency-aware emotion forecasting, pragmatic strategy planning, and strategy-guided response generation, combined with a global reflection and refinement module that performs step-wise auditing to eliminate emotional biases and empathy errors, allowing iterative improvement.

What carries the argument

A multi-agent framework consisting of a structured empathetic reasoning-to-generation module that provides an explicit path from multimodal evidence to response, and a global reflection agent that audits intermediate states and triggers targeted regeneration.

If this is right

The model demonstrates superior empathic response generation capabilities on benchmarks such as IEMOCAP and MELD compared to state-of-the-art methods.
Emotional biases are systematically eliminated through the closed-loop iteration process.
The hierarchical progression of emotion perception is explicitly modeled, reducing distorted emotional judgments.
Targeted regeneration based on reflection improves overall empathy accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method may extend to other ambiguous generation tasks like sarcasm detection or personalized advice.
Real-world deployment in chatbots could be tested by measuring user-perceived empathy in live interactions.
The reflection module might be adapted to single large language model setups for self-correction without multiple agents.
Combining this with more advanced multimodal encoders could further boost performance on diverse inputs.

Load-bearing premise

The one-pass generation paradigm overlooks the hierarchical progression of emotion perception and introduces significant emotional biases that a closed-loop multi-agent structure can eliminate.

What would settle it

If ablation studies or comparisons on IEMOCAP and MELD show that removing the reflection module or the structured decomposition does not reduce empathy performance metrics, the advantage of the proposed framework would be falsified.

Figures

Figures reproduced from arXiv: 2604.18988 by Bo Hu, Cheng Ye, Liping Wang, Peipei Song, Weidong Chen, Zhendong Mao.

**Figure 2.** Figure 2: 3.2 Structured Reasoning-to-Generation The first four agents form a structured empathetic reasoning-togeneration pipeline. We adopt this progressive design because multimodal empathetic response generation involves a sequence of tightly coupled yet distinct processes, rather than a direct mapping from context to the final response [32, 37]. By explicitly organizing multimodal perception, emotion forecas… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed framework. Given the multimodal context, the structured empathetic reasoning-to [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Distribution of GRA error attributions across A1– [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 3.** Figure 3: Tone and interpersonal stance distributions under [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Case study of closed-loop reflection. The critic iden [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Multimodal empathetic response generation (MERG) aims to generate emotionally engaging and empathetic responses based on users' multimodal contexts. Existing approaches usually rely on an implicit one-pass generation paradigm from multimodal context to the final response, which overlooks two intrinsic characteristics of MERG: (1) Human perception of emotional cues is inherently structured rather than a direct mapping. The conventional paradigm neglects the hierarchical progression of emotion perception, leading to distorted emotional judgments. (2) Given the inherent complexity and ambiguity of human emotions, the conventional paradigm is prone to significant emotional biases, ultimately resulting in suboptimal empathy. In this paper, we propose a multi-agent framework for MERG, which enhances empathy through structured reasoning and reflective refinement. Specifically, we first introduce a structured empathetic reasoning-to-generation module that explicitly decomposes response generation via multimodal perception, consistency-aware emotion forecasting, pragmatic strategy planning, and strategy-guided response generation, providing a clearer intermediate path from multimodal evidence to response realization. Besides, we develop a global reflection and refinement module, in which a global reflection agent performs step-wise auditing over intermediate states and the generated response, eliminating existing emotional biases and empathy errors, and triggering targeted regeneration. Overall, such a closed-loop framework enables our model to gradually improve the accuracy of emotion perception and eliminate emotion biases during the iteration process. Experiments on several benchmarks, e.g., IEMOCAP and MELD, demonstrate that our model has superior empathic response generation capabilities compared to state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical multi-agent recipe that breaks multimodal empathetic response generation into four explicit stages plus a global reflection loop to reduce biases, though the experimental support is still thin on details.

read the letter

The main thing here is a multi-agent setup that splits response generation into multimodal perception, consistency-aware emotion forecasting, pragmatic strategy planning, and strategy-guided generation, then adds a global reflection agent to audit the intermediates and trigger fixes. This closed-loop structure is the concrete new element for the MERG task, even if multi-agent decomposition and self-critique have shown up in other dialogue work. It turns the usual one-pass black box into something with visible checkpoints and built-in error correction, which is a useful engineering step for handling ambiguous emotional cues from video and audio. The motivation is clear: one-pass models can miss the hierarchical nature of emotion perception and pick up biases, and the staged approach with reflection aims to fix that systematically. The paper does a decent job spelling out the pipeline and why each stage matters. On the experiments, the abstract claims better empathic responses than state-of-the-art on IEMOCAP and MELD, but it gives no numbers, no ablation results, and no statistical checks. Without those, it is hard to tell how much the framework actually moves the needle versus other factors like model scale. The assumption that conventional methods are inherently prone to significant biases holds as a reasonable starting point, but it would be stronger with direct evidence from the results. This is aimed at people working on affective computing and multimodal dialogue who want a reusable template for more controllable generation. It shows clear thinking in the design and honest engagement with the problem, so it deserves a serious referee to check the full implementation and numbers.

Referee Report

2 major / 0 minor

Summary. The paper proposes a multi-agent framework for multimodal empathetic response generation (MERG) that decomposes the task into structured reasoning steps—multimodal perception, consistency-aware emotion forecasting, pragmatic strategy planning, and strategy-guided response generation—augmented by a global reflection and refinement module that performs step-wise auditing to eliminate emotional biases and empathy errors in a closed-loop process. It claims this overcomes limitations of conventional one-pass generation paradigms and yields superior empathic response capabilities on benchmarks such as IEMOCAP and MELD.

Significance. If the empirical results hold under standard controls, the work provides a concrete engineering contribution to affective computing by making emotion reasoning explicit and auditable, which could improve robustness in applications like conversational agents and mental-health support systems; the multi-agent decomposition with reflection is a reusable pattern that may generalize beyond MERG.

major comments (2)

Abstract: the claim of 'superior empathic response generation capabilities' on IEMOCAP and MELD supplies no quantitative deltas, ablation results, or statistical tests, which is load-bearing for the central empirical contribution; the experiments section must include these with controls for model size and training data to substantiate the gains.
Method (structured empathetic reasoning-to-generation module): the consistency-aware emotion forecasting step is described at a high level without specifying the consistency metric or how it interacts with prior emotion-labeling models; this risks circularity if the forecasting simply re-uses outputs from external classifiers without independent validation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: Abstract: the claim of 'superior empathic response generation capabilities' on IEMOCAP and MELD supplies no quantitative deltas, ablation results, or statistical tests, which is load-bearing for the central empirical contribution; the experiments section must include these with controls for model size and training data to substantiate the gains.

Authors: We agree that quantitative support is essential for the central claims. In the revised manuscript, the experiments section now includes specific performance deltas (e.g., absolute and relative gains in empathy and consistency metrics on both IEMOCAP and MELD), full ablation studies for each module, and statistical significance tests (paired t-tests with p-values). We have also added explicit controls by reporting results against baselines matched for parameter count and training data volume. The abstract has been updated to reference the key quantitative improvements. revision: yes
Referee: Method (structured empathetic reasoning-to-generation module): the consistency-aware emotion forecasting step is described at a high level without specifying the consistency metric or how it interacts with prior emotion-labeling models; this risks circularity if the forecasting simply re-uses outputs from external classifiers without independent validation.

Authors: We appreciate this clarification request. The consistency-aware emotion forecasting step computes a consistency score via cosine similarity between the forecasted emotion embedding sequence and the multimodal perceptual features extracted in the preceding step; this score is produced by a dedicated lightweight consistency scorer trained jointly but evaluated independently against ground-truth emotion trajectories from the dataset. We have revised the method section to provide the exact formulation of the metric, its training objective, and the independent validation protocol that avoids direct reuse of external classifiers, thereby addressing the circularity concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity in framework proposal or empirical claims

full rationale

The paper proposes a multi-agent framework with explicit decomposition (perception → forecasting → planning → generation) plus global reflection as an engineering artifact to address one-pass generation limitations. No mathematical derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps are present in the abstract or described structure. Central claims rest on benchmark experiments (IEMOCAP, MELD) showing performance gains, which are independent of the framework's internal definitions. The approach is self-contained as a novel architecture without reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the domain assumption that explicit decomposition of emotion perception plus iterative reflection will reliably reduce bias; no new physical or mathematical entities are postulated and no free parameters are named in the abstract.

axioms (2)

domain assumption Human perception of emotional cues is inherently structured rather than a direct mapping.
Invoked in the abstract to motivate the structured module; treated as self-evident.
domain assumption The conventional one-pass paradigm is prone to significant emotional biases.
Used to justify the need for the reflection module.

pith-pipeline@v0.9.0 · 5585 in / 1490 out tokens · 37898 ms · 2026-05-10T03:27:25.230880+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 22 canonical work pages

[1]

Chang, Sungbok Lee, and Shrikanth S

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Sungbok Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan
[2]

Narayanan

IEMOCAP: Interactive emotional dyadic motion capture database.Language Resources and Evaluation42, 4 (2008), 335–359. doi:10.1007/s10579-008-9076-6

work page doi:10.1007/s10579-008-9076-6 2008
[3]

Feiyu Chen, Jie Shao, Shuyuan Zhu, and Heng Tao Shen. 2023. Multivari- ate, Multi-Frequency and Multimodal: Rethinking Graph Neural Networks for Emotion Recognition in Conversation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10761–10770. https://openaccess.thecvf.com/content/CVPR2023/html/Chen_Multivariate_...

2023
[4]

Yi Cheng, Wenge Liu, Wenjie Li, Jiashuo Wang, Ruihui Zhao, Bang Liu, Xiaodan Liang, and Yefeng Zheng. 2022. Improving Multi-turn Emotional Support Dia- logue Generation with Lookahead Strategy Planning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Associat...

work page doi:10.18653/v1/2022.emnlp- 2022
[5]

Yang Deng, Wenxuan Zhang, Yifei Yuan, and Wai Lam. 2023. Knowledge- enhanced Mixed-initiative Dialogue System for Emotional Support Conversations. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada. doi:10.18653/v1/2023.acl-long.225

work page doi:10.18653/v1/2023.acl-long.225 2023
[6]

Hao Fei et al . 2024. EmpathyEar: An Open-source Avatar Multimodal Empa- thetic Chatbot. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

2024
[7]

Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. Misa: Modality-invariant and-specific representations for multimodal sentiment anal- ysis. InProceedings of the 28th ACM International Conference on Multimedia. 1122–1131

2020
[8]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber
[9]

InThe Twelfth International Conference on Learning Representations

MetaGPT: Meta Programming for A Multi-Agent Collaborative Frame- work. InThe Twelfth International Conference on Learning Representations. https: //openreview.net/forum?id=VtmBAGCN7o
[10]

Jingwen Hu, Yuchen Liu, Jinming Zhao, and Qin Jin. 2021. MMGCN: Multi- modal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Associa...

work page doi:10.18653/v1/2021.acl-long.440 2021
[11]

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Jainel Morjaria, Nixon Lau, Seung-Hee Lee, Dixit Bhatia, Ahmed Hassan Awadallah, Karthik Narasimhan, and Denny Zhou. 2024. Large Language Models Cannot Self-Correct Reasoning Yet. InThe Twelfth International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=IkmPE9X7vM

2024
[12]

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A Diversity-Promoting Objective Function for Neural Conversation Models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, Califo...

work page doi:10.18653/v1/n16- 2016
[13]

Qintong Li, Hongshen Chen, Zhaochun Ren, Pengjie Ren, Zhaopeng Tu, and Zhumin Chen. 2020. EmpDG: Multi-resolution Interactive Empathetic Dialogue Generation. InProceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 4454–4466. https://aclanthology.org/2...

2020
[14]

Qintong Li, Piji Li, Zhaochun Ren, Pengjie Ren, and Zhumin Chen. 2022. Knowl- edge Bridging for Empathetic Dialogue Generation. InProceedings of the AAAI Conference on Artificial Intelligence. https://qtli.github.io/publication/kemp/

2022
[15]

Yifan Lin et al. 2025. E3RG: Building Explicit Emotion-driven Empathetic Re- sponse Generation System with Multimodal Large Language Model. InProceed- ings of the 33rd ACM International Conference on Multimedia

2025
[16]

Zhaojiang Lin, Andrea Madotto, Jamin Shin, Peng Xu, and Pascale Fung. 2019. MoEL: Mixture of Empathetic Listeners. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 12...

work page doi:10.18653/v1/d19- 2019
[17]

Siyang Liu, Chujie Zheng, Orianna Demasi, Sahand Sabour, Yu Li, Zhou Yu, Yong Jiang, and Minlie Huang. 2021. Towards Emotional Support Dialog Systems. In Proceedings of the 59th Annual Meeting of the Association for Computational Lin- guistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association f...

2021
[18]

doi:10.18653/v1/2021.acl-long.269

work page doi:10.18653/v1/2021.acl-long.269 2021
[19]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. InAdvances in Neural Information Processi...

2023
[21]

Gelbukh, and Erik Cambria

Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander F. Gelbukh, and Erik Cambria. 2019. DialogueRNN: An Attentive RNN for Emotion Detection in Conversations. InProceedings of the AAAI Confer- ence on Artificial Intelligence, Vol. 33. 6818–6825. doi:10.1609/aaai.v33i01.33016818

work page doi:10.1609/aaai.v33i01.33016818 2019
[22]

Yuzhao Mao, Guang Liu, Xiaojie Wang, Weiguo Gao, and Xuan Li. 2021. Dia- logueTRM: Exploring Multi-Modal Emotional Dynamics in a Conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021. Associa- tion for Computational Linguistics, Punta Cana, Dominican Republic, 2694–2704. doi:10.18653/v1/2021.findings-emnlp.229

work page doi:10.18653/v1/2021.findings-emnlp.229 2021
[23]

Ollama. 2026. Ollama Documentation. https://docs.ollama.com. Accessed: 2026-03-28

2026
[24]

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 527–536. doi:10.186...

work page doi:10.18653/v1/p19-1050 2019
[25]

Qwen Team. 2026. Qwen/Qwen3.5-27B. https://huggingface.co/Qwen/Qwen3.5- 27B. Official model card. Accessed: 2026-03-28. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Wang et al

2026
[26]

Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 5370–5381. https://aclanthology.org/P19-1534/

2019
[27]

Sahand Sabour, Chujie Zheng, and Minlie Huang. 2022. CEM: Commonsense- Aware Empathetic Response Generation.Proceedings of the AAAI Conference on Artificial Intelligence36, 10 (2022), 11229–11237. doi:10.1609/aaai.v36i10.21373

work page doi:10.1609/aaai.v36i10.21373 2022
[28]

Narasimhan, and Shunyu Yao

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems, Vol. 36. https: //openreview.net/forum?id=vAElhFcKW6

2023
[29]

Geng Tu, Feng Xiong, Bin Liang, Hui Wang, Xi Zeng, and Ruifeng Xu. 2024. Multimodal Emotion Recognition Calibration in Conversations. InProceedings of the 32nd ACM International Conference on Multimedia. Association for Computing Machinery, Melbourne, VIC, Australia, 9621–9630. doi:10.1145/3664647.3681515

work page doi:10.1145/3664647.3681515 2024
[30]

Chenwei Wan, Matthieu Labeau, and Chloé Clavel. 2025. EmoDynamiX: Emo- tional Support Dialogue Strategy Prediction by Modelling MiXed Emotions and Discourse Dynamics. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association...

work page doi:10.18653/v1/2025.naacl- 2025
[31]

Lanrui Wang, Jiangnan Li, Zheng Lin, Fandong Meng, Chenxu Yang, Weiping Wang, and Jie Zhou. 2022. Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible Knowledge Selection. InFindings of the Association for Computational Linguistics: EMNLP 2022. Association for Computational Linguis- tics, Abu Dhabi, United Arab Emirates, 4634–4645...

work page doi:10.18653/v1/2022.findings- 2022
[32]

Jiaqiang Wu, Xuandong Huang, Zhouan Zhu, and Shangfei Wang. 2025. From Traits to Empathy: Personality-Aware Multimodal Empathetic Response Gen- eration. InProceedings of the 31st International Conference on Computational Linguistics. Association for Computational Linguistics, Abu Dhabi, UAE, 8925–

2025
[33]

https://aclanthology.org/2025.coling-main.598/

2025
[34]

Jiaqiang Wu, Shangfei Wang, Yanan Chang, and Zhouan Zhu. 2025. Empathetic Response Generation Through Multi-modality.IEEE Transactions on Affective Computing(2025). doi:10.1109/TAFFC.2025.3599869 Early access

work page doi:10.1109/taffc.2025.3599869 2025
[35]

White, Doug Burger, and Chi Wang

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. 2024. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. In First Conference on Language Modeling. https://www.microsoft.com/en- us/research/publ...

2024
[36]

Yangyang Xu, Jinpeng Hu, Zhuoer Zhao, Zhangling Duan, Xiao Sun, and Xun Yang. 2025. MultiAgentESC: A LLM-based Multi-Agent Collaboration Framework for Emotional Support Conversation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 4665–4681. doi:10.18653/v1...

work page doi:10.18653/v1/2025.emnlp-main.232 2025
[37]

Zhou Yang, Zhaochun Ren, Wang Yufeng, Haizhou Sun, Chao Chen, Xiaofei Zhu, and Xiangwen Liao. 2024. An Iterative Associative Memory Model for Empathetic Response Generation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Bangkok, Thailand, 3081–3...

work page doi:10.18653/v1/ 2024
[38]

Dong Zhang, Weisheng Zhang, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. 2020. Modeling both Intra- and Inter-modal Influence for Real-Time Emotion Detection in Conversations. InProceedings of the 28th ACM Interna- tional Conference on Multimedia. Association for Computing Machinery, 503–511. doi:10.1145/3394171.3413949

work page doi:10.1145/3394171.3413949 2020
[39]

Han Zhang et al. 2025. The ACM Multimedia 2025 Grand Challenge of Avatar- based Multimodal Empathetic Response Generation. InProceedings of the 33rd ACM International Conference on Multimedia

2025
[40]

Han Zhang, Zixiang Meng, Meng Luo, Hong Han, Lizi Liao, Erik Cambria, and Hao Fei. 2025. Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark.arXiv preprint arXiv:2502.04976 (2025). https://arxiv.org/abs/2502.04976

work page arXiv 2025
[41]

Han Zhang, Zixiang Meng, Meng Luo, Hong Han, Lizi Liao, Erik Cambria, and Hao Fei. 2025. Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark. InProceedings of the ACM Web Conference 2025 (WWW ’25). 2872–2881. doi:10.1145/3696410.3714739

work page doi:10.1145/3696410.3714739 2025
[42]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi
[43]

InInternational Confer- ence on Learning Representations

BERTScore: Evaluating Text Generation with BERT. InInternational Confer- ence on Learning Representations. https://openreview.net/forum?id=SkeHuCVFDr
[44]

Yiqun Zhang, Fanheng Kong, Peidong Wang, Shuang Sun, Lingshuai Wang, Shi Feng, Daling Wang, Yifei Zhang, and Kaisong Song. 2024. STICKERCONV: Generating Multimodal Empathetic Responses from Scratch. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, ...

work page doi:10.18653/v1/2024.acl-long.417 2024
[45]

Weixiang Zhao, Yanyan Zhao, Xin Lu, and Bing Qin. 2023. Don’t Lose Your- self! Empathetic Response Generation via Explicit Self-Other Awareness. In Findings of the Association for Computational Linguistics: ACL 2023. Associ- ation for Computational Linguistics, Toronto, Canada, 13331–13344. https: //aclanthology.org/2023.findings-acl.843/

2023
[46]

Jinfeng Zhou, Chujie Zheng, Bo Wang, Zheng Zhang, and Minlie Huang. 2023. CASE: Aligning Coarse-to-Fine Cognition and Affection for Empathetic Response Generation. InProceedings of the 61st Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 8223–8237. doi:1...

work page doi:10.18653/v1/2023.acl-long.457 2023