Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning

Nima Mesgarani; Riki Shimizu; Sukru Samet Dindar; Xilin Jiang

arxiv: 2606.00851 · v1 · pith:GCQVRYFDnew · submitted 2026-05-30 · 💻 cs.SD · cs.CL· cs.HC· cs.LG· eess.AS

Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning

Sukru Samet Dindar , Riki Shimizu , Xilin Jiang , Nima Mesgarani This is my paper

Pith reviewed 2026-06-28 17:54 UTC · model grok-4.3

classification 💻 cs.SD cs.CLcs.HCcs.LGeess.AS

keywords Sympatheiaemotionally adaptive voice assistantcontinuous affect conditioningvalence-arousal controlspeech-to-speech dialoguemultimodal emotion sensingsynthetic dialogue corpus

0 comments

The pith

Sympatheia conditions voice assistant responses on continuous valence-arousal signals to ensure emotional appropriateness even with weak speech cues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a speech-to-speech system can produce emotionally fitting replies by taking an explicit continuous valence-arousal input in addition to the spoken query. This matters for cases where the user's speech does not clearly convey emotion, as the signal can come from other sources like facial analysis or user input. The authors train the model on a new synthetic corpus that separates neutral queries from multiple emotional responses. Experiments demonstrate superior performance over baselines in matching both content and delivery to the target emotion. The same conditioning mechanism supports combining emotion data from various sensing methods.

Core claim

Sympatheia is a speech-to-speech dialogue framework conditioned on affect from the user's speech and, when available, an explicit continuous valence-arousal control signal. It is trained on the Sympatheia-18k corpus featuring 12 emotion anchors with emotional and neutral splits. The system generates responses whose semantic content and spoken delivery are emotionally appropriate, outperforming speech conversational baselines, and integrates emotion estimates from facial expression, biosignals, and textual descriptions to improve alignment when speech is limited.

What carries the argument

The continuous valence-arousal (VA) control signal used to condition the speech generation for emotional adaptation.

If this is right

The generated responses align emotionally in both semantics and acoustics.
Multimodal inputs enhance performance when speech affect is ambiguous.
The VA interface allows flexible integration of different emotion sensing modules.
Neutral query splits isolate the effect of explicit control on response choice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Direct user specification of desired emotional tone could become a standard control in voice interfaces.
The approach may extend naturally to text or visual output generation for consistent affect across modalities.

Load-bearing premise

Synthetic data from the 18k corpus transfers its emotional and acoustic properties to real-world user speech for effective conditioning.

What would settle it

A direct comparison of human ratings for emotional fit on outputs from the conditioned model versus baselines when tested on real ambiguous speech recordings.

Figures

Figures reproduced from arXiv: 2606.00851 by Nima Mesgarani, Riki Shimizu, Sukru Samet Dindar, Xilin Jiang.

**Figure 1.** Figure 1: Motivation for SYMPATHEIA. The same user utterance may call for different responses depending on the speaker’s emotional state, illustrated in a valence-arousal space. The system aims to generate spoken responses that are both semantically appropriate and emotionally aligned. Prior work on empathetic and emotion-aware dialogue has made important progress in generating more contextually appropriate, support… view at source ↗

**Figure 2.** Figure 2: System overview of SYMPATHEIA. The core speech-to-speech dialogue model uses affective cues in the user’s spoken query directly and can additionally receive explicit valence–arousal conditioning from pluggable emotion sensing modules, such as facial expression, EEG/physiological signals, text descriptions, or direct interface selection. These optional external estimates are converted into a shared VA inter… view at source ↗

**Figure 3.** Figure 3: The architecture of the SYMPATHEIA speech-to-speech model. Input audio is encoded into discrete speech tokens, while optional emotion recognition modules map facial expression, EEG/physiological signals, or user-provided affect descriptions into continuous valence–arousal (VA) values. The VA values and user speech tokens are fed to the language model, which generates affect-conditioned response speech toke… view at source ↗

**Figure 4.** Figure 4: Screenshot of the survey interface used for the human Emotion MOS evaluation. [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

read the original abstract

Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user's speech and, when available, explicit affect specifications provided as a continuous valence--arousal (VA) control signal by a multimodal sensing module or user interface. To train our model, we construct Sympatheia-18k, an emotion-conditioned synthetic spoken dialogue corpus with 12 emotion anchors. This dataset includes an emotional split for learning affective speech behavior, and a neutral split that pairs emotionally neutral queries with multiple emotion-conditioned responses to isolate explicit emotion control in emotionally ambiguous cases. Empirical results show that Sympatheia outperforms speech conversational baselines in generating responses whose semantic content and spoken delivery are both emotionally appropriate. We further show that the same VA interface can integrate emotion estimates from diverse sensing modules, including facial expression, biosignals, and textual affect descriptions, improving response alignment when speech alone provides limited emotional evidence. These results suggest that continuous affect conditioning is an effective practical step for building emotionally adaptive voice assistants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sympatheia adds continuous VA conditioning and a new synthetic 18k corpus with neutral/emotional splits, but the transfer from synthetic to real speech is the unaddressed gap.

read the letter

The main contribution is a speech-to-speech framework that conditions responses on continuous valence-arousal signals, either inferred from speech or supplied explicitly from other sensors. They built Sympatheia-18k around 12 emotion anchors, with an emotional split for affective behavior and a neutral split that pairs ambiguous queries with multiple conditioned responses. This split is a straightforward way to force the model to use the external control signal rather than defaulting to input speech.

The approach makes sense for the stated problem: everyday voice queries are often neutral or weak on affect, so allowing multimodal inputs improves alignment. The neutral-split design is a practical engineering choice that directly targets the ambiguous case.

The soft spot is the data. The synthetic corpus is generated, and nothing in the description shows that its acoustic or emotional distributions match real user speech in prosody, timbre, or intensity. Results on held-out synthetic test cases therefore do not establish the practical claim for actual voice-assistant use. The abstract asserts outperformance on semantic content and spoken delivery but supplies no metrics, baseline names, or ablation numbers, so the size of the gain cannot be judged. No code or data release is mentioned.

This is for people building spoken dialogue systems who want an explicit affect knob. A reader already working on multimodal conditioning could borrow the split idea or the interface pattern.

Send it to peer review. The engineering pattern is clear enough to be worth checking whether the synthetic results survive real-speech validation or at least detailed ablations.

Referee Report

2 major / 1 minor

Summary. The paper introduces Sympatheia, a speech-to-speech dialogue framework conditioned on continuous valence-arousal (VA) affect signals inferred from user speech and, when available, explicit multimodal or UI-provided VA controls. It constructs the Sympatheia-18k synthetic corpus using 12 emotion anchors, with an emotional split for affective behavior and a neutral split pairing neutral queries with multiple emotion-conditioned responses. The central empirical claim is that the resulting model outperforms speech conversational baselines in producing responses that are emotionally appropriate in both semantic content and spoken delivery, and that the VA interface enables effective integration of emotion estimates from facial, biosignal, and textual sources to improve alignment when speech cues are weak or ambiguous.

Significance. If the generalization claims hold, the work offers a concrete, practical mechanism for explicit continuous affect control in voice assistants via synthetic data with designed splits, which could address a common limitation in empathetic dialogue systems. The multimodal VA integration is a clear strength for real-world deployment where speech alone is often insufficient. The synthetic corpus approach with neutral/emotional splits is a methodological contribution worth noting if accompanied by evidence of transfer.

major comments (2)

[Empirical evaluation / results] The central empirical claim (abstract and results) that Sympatheia outperforms baselines in generating emotionally appropriate responses rests on evaluation using held-out portions of the synthetic Sympatheia-18k corpus. No cross-domain evaluation on real user speech is described, leaving the practical claim for voice-assistant scenarios where speech is ambiguous unsupported.
[Dataset construction] Dataset section: The Sympatheia-18k corpus is generated with 12 emotion anchors and TTS-based synthesis for the emotional/neutral splits. The manuscript provides no analysis or ablation addressing potential systematic differences in prosody, timbre, or affect intensity between the synthetic data and natural speech, which directly affects whether the learned continuous VA conditioning transfers to the target use case.

minor comments (1)

[Abstract] The abstract states outperformance without naming the specific metrics (e.g., semantic similarity, prosody measures, or human ratings) or the exact baselines; adding these would improve clarity even if full details appear later.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Empirical evaluation / results] The central empirical claim (abstract and results) that Sympatheia outperforms baselines in generating emotionally appropriate responses rests on evaluation using held-out portions of the synthetic Sympatheia-18k corpus. No cross-domain evaluation on real user speech is described, leaving the practical claim for voice-assistant scenarios where speech is ambiguous unsupported.

Authors: We agree that the reported results are obtained on held-out portions of the synthetic corpus. This evaluation was chosen to isolate the effects of the emotional and neutral splits under controlled conditions. We acknowledge that this leaves the generalization to real user speech untested and weakens the practical claims for ambiguous voice-assistant scenarios. We will revise the manuscript by adding a limitations subsection that explicitly qualifies the empirical claims and outlines the need for future cross-domain validation on natural speech data. revision: yes
Referee: [Dataset construction] Dataset section: The Sympatheia-18k corpus is generated with 12 emotion anchors and TTS-based synthesis for the emotional/neutral splits. The manuscript provides no analysis or ablation addressing potential systematic differences in prosody, timbre, or affect intensity between the synthetic data and natural speech, which directly affects whether the learned continuous VA conditioning transfers to the target use case.

Authors: The referee is correct that the manuscript contains no ablation or quantitative comparison of prosody, timbre, or affect intensity between the TTS-generated data and natural speech. Our dataset design emphasizes controllability via the 12 anchors and the neutral/emotional split structure. We will revise the dataset section to include a short discussion of known TTS-to-natural domain gaps, supported by references to prior work on synthetic speech transfer, and will add an explicit statement that transfer of the VA conditioning remains to be verified on natural recordings. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical ML system paper with no derivation chain

full rationale

The manuscript describes construction of a synthetic corpus (Sympatheia-18k with 12 emotion anchors and neutral/emotional splits), training of a speech-to-speech model conditioned on continuous valence-arousal, and empirical evaluation against baselines. No equations, first-principles derivations, or mathematical predictions appear in the provided text. Central claims rest on measured outperformance and multimodal integration rather than any reduction of a 'prediction' to fitted inputs or self-citation chains. No self-definitional, fitted-input, or uniqueness-theorem patterns are present. This is a standard empirical systems contribution whose validity hinges on data transfer assumptions, not circular logic.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no mathematical derivations, model equations, or training objectives; therefore no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5762 in / 1140 out tokens · 25488 ms · 2026-06-28T17:54:37.833830+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 20 canonical work pages · 9 internal anchors

[1]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, 2020

2020
[2]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, 2022

2022
[3]

Towards empathetic open-domain conversation models: A new benchmark and dataset

Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. Towards empathetic open-domain conversation models: A new benchmark and dataset. InProceedings of ACL, 2019

2019
[4]

Klaus R. Scherer. V ocal communication of emotion: A review of research paradigms.Speech Communication, 40(1–2):227–256, 2003

2003
[5]

Roisman, and Thomas S

Zhihong Zeng, Maja Pantic, Glenn I. Roisman, and Thomas S. Huang. A survey of affect recognition methods: Audio, visual, and spontaneous expressions.IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1):39–58, 2009

2009
[6]

James A. Russell. A circumplex model of affect.Journal of Personality and Social Psychology, 39(6):1161–1178, 1980

1980
[7]

Chang, Sungbok Lee, and Shrikanth S

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. IEMOCAP: Interactive emotional dyadic motion capture database.Language Resources and Evaluation, 42(4):335– 359, 2008

2008
[8]

MELD: A multimodal multi-party dataset for emotion recognition in conversa- tions

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. MELD: A multimodal multi-party dataset for emotion recognition in conversa- tions. InProceedings of ACL, 2019

2019
[9]

MoEL: Mixture of empathetic listeners

Zhaojiang Lin, Andrea Madotto, Jamin Shin, Peng Xu, and Pascale Fung. MoEL: Mixture of empathetic listeners. InProceedings of EMNLP-IJCNLP, 2019

2019
[10]

MIME: MIMicking emotions for empathetic response generation

Navonil Majumder, Pengfei Hong, Shanshan Peng, Jiankun Lu, Deepanway Ghosal, Alexander Gelbukh, Rada Mihalcea, and Soujanya Poria. MIME: MIMicking emotions for empathetic response generation. InProceedings of EMNLP, 2020

2020
[11]

Reza Lotfian and Carlos Busso. Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings.IEEE Transactions on Affective Computing, 10(4):471–483, 2019. doi: 10.1109/TAFFC.2017.2736999. 10

work page doi:10.1109/taffc.2017.2736999 2019
[12]

Salman, Wei-Cheng Lin, Lucas Goncalves, Srinivas Parthasarathy, Abinay Reddy Naini, Seong-Gyun Leem, Luz Martinez-Lucas, Huang- Cheng Chou, and Pravin Mote

Carlos Busso, Reza Lotfian, Kusha Sridhar, Ali N. Salman, Wei-Cheng Lin, Lucas Goncalves, Srinivas Parthasarathy, Abinay Reddy Naini, Seong-Gyun Leem, Luz Martinez-Lucas, Huang- Cheng Chou, and Pravin Mote. The MSP-Podcast corpus.arXiv preprint arXiv:2509.09791, 2025

work page arXiv 2025
[13]

Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. AffectNet: A database for facial expression, valence, and arousal computing in the wild.IEEE Transactions on Affective Computing, 10(1):18–31, 2019

2019
[14]

DEAP: A database for emotion analysis using physiological signals.IEEE Transactions on Affective Computing, 3(1): 18–31, 2012

Sander Koelstra, Christian Mühl, Mohammad Soleymani, Jong-Seok Lee, Ashkan Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras. DEAP: A database for emotion analysis using physiological signals.IEEE Transactions on Affective Computing, 3(1): 18–31, 2012

2012
[15]

SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities,

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities.arXiv preprint arXiv:2305.11000, 2023

work page arXiv 2023
[16]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: A speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. GLM-4-V oice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, et al. Qwen3-Omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Kimi-Audio Technical Report

KimiTeam, Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, et al. Kimi-Audio technical report.arXiv preprint arXiv:2504.18425, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

BLSP-Emo: Towards empathetic large speech-language models

Chen Wang, Minpeng Liao, Zhongqiang Huang, Junhong Wu, Chengqing Zong, and Jiajun Zhang. BLSP-Emo: Towards empathetic large speech-language models. InProceedings of EMNLP, 2024

2024
[21]

OpenS2S: Advancing fully open- source end-to-end empathetic large speech language model.arXiv preprint arXiv:2507.05177, 2025

Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxi- ang Wu, Jinqiao Wang, Chengqing Zong, and Jiajun Zhang. OpenS2S: Advancing fully open- source end-to-end empathetic large speech language model.arXiv preprint arXiv:2507.05177, 2025

work page arXiv 2025
[22]

Empathy Omni: Enabling empathetic speech response generation through large language models.arXiv preprint arXiv:2508.18655, 2025

Haoyu Wang, Guangyan Zhang, Jiale Chen, Jingyu Li, Yuehai Wang, and Yiwen Guo. Empathy Omni: Enabling empathetic speech response generation through large language models.arXiv preprint arXiv:2508.18655, 2025

work page arXiv 2025
[23]

OSUM-EChat: Enhancing end-to-end empathetic spoken chatbot via understanding-driven spoken dialogue.arXiv preprint arXiv:2508.09600, 2025

Xuelong Geng, Qijie Shao, Hongfei Xue, Shuiyuan Wang, Hanke Xie, Zhao Guo, Yi Zhao, Guojian Li, Wenjie Tian, Chengyou Wang, Zhixian Zhao, Kangxiang Xia, Ziyu Zhang, Zhennan Lin, Tianlun Zuo, Mingchen Shao, Yuang Cao, Guobin Ma, Longhao Li, Yuhang Dai, Dehui Gao, Dake Guo, and Lei Xie. OSUM-EChat: Enhancing end-to-end empathetic spoken chatbot via understa...

work page arXiv 2025
[24]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision.arXiv preprint arXiv:2212.04356, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In Proceedings of ICLR, 2022. 11

2022
[26]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, et al. Qwen3-TTS technical report.arXiv preprint arXiv:2601.15621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Savchenko

Andrey V . Savchenko. HSEmotion: High-speed emotion recognition library.Software Impacts, 14:100433, 2022

2022
[29]

SEED-VII: A multimodal dataset of six basic emotions with continuous labels for emotion recognition.IEEE Transactions on Affective Computing, 16(2):969–985, 2025

Wei-Bang Jiang, Xuan-Hao Liu, Wei-Long Zheng, and Bao-Liang Lu. SEED-VII: A multimodal dataset of six basic emotions with continuous labels for emotion recognition.IEEE Transactions on Affective Computing, 16(2):969–985, 2025

2025
[30]

YAAD: Young adult’s affective data using wearable ECG and GSR sensors

Muhammad Najam Dar, Amna Rahim, Muhammad Usman Akram, Sajid Gul Khawaja, and Aqsa Rahim. YAAD: Young adult’s affective data using wearable ECG and GSR sensors. In2022 2nd International Conference on Digital Futures and Transformative Technologies (ICoDT2), pages 1–7. IEEE, 2022

2022
[31]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of CVPR, 2016

2016
[32]

Scherer and Harald G

Klaus R. Scherer and Harald G. Wallbott. Evidence for universality and cultural variation of differential emotion response patterning.Journal of Personality and Social Psychology, 66(2): 310–328, 1994

1994
[33]

Emotion English DistilRoBERTa-base

Jochen Hartmann. Emotion English DistilRoBERTa-base. https://huggingface.co/ j-hartmann/emotion-english-distilroberta-base/, 2022

2022
[34]

URO-Bench: Towards comprehensive evaluation for end-to-end spoken dialogue models

Ruiqi Yan, Xiquan Li, Wenxi Chen, Zhikang Niu, Chen Yang, Ziyang Ma, Kai Yu, and Xie Chen. URO-Bench: Towards comprehensive evaluation for end-to-end spoken dialogue models. InFindings of the Association for Computational Linguistics: EMNLP, 2025

2025
[35]

VoiceBench: Benchmarking LLM-Based Voice Assistants

Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T. Tan, and Haizhou Li. V oiceBench: Benchmarking LLM-based voice assistants.arXiv preprint arXiv:2410.17196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Prolific.ac–a subject pool for online experiments.Journal of Behavioral and Experimental Finance, 17:22–27, 2018

Stefan Palan and Christian Schitter. Prolific.ac–a subject pool for online experiments.Journal of Behavioral and Experimental Finance, 17:22–27, 2018. doi: 10.1016/j.jbef.2017.12.004

work page doi:10.1016/j.jbef.2017.12.004 2018
[37]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with BERT. InProceedings of ICLR, 2020

2020
[38]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, 2004

2004
[39]

Mini-Omni: Language models can hear, talk while thinking in streaming.arXiv preprint arXiv:2408.16725, 2024

Zhifei Xie and Changqiao Wu. Mini-Omni: Language models can hear, talk while thinking in streaming.arXiv preprint arXiv:2408.16725, 2024

work page arXiv 2024
[40]

UTMOS: UTokyo-SaruLab system for V oiceMOS challenge 2022

Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. UTMOS: UTokyo-SaruLab system for V oiceMOS challenge 2022. In Proceedings of Interspeech, pages 4521–4525, 2022

2022
[41]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of ICLR, 2019

2019
[42]

DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of KDD, 2020

2020
[43]

A dictionary of affect in language: I

Kevin Sweeney and Cynthia Whissell. A dictionary of affect in language: I. establishment and preliminary validation.Perceptual and Motor Skills, 59(3):695–698, 1984. doi: 10.2466/pms. 1984.59.3.695. 12

work page doi:10.2466/pms 1984
[44]

Harper & Row, New York, 1980

Robert Plutchik.Emotion: A Psychoevolutionary Synthesis. Harper & Row, New York, 1980

1980
[45]

Roddy Cowie, Ellen Douglas-Cowie, Nicolas Tsapatsoulis, George V otsis, Stefanos Kollias, Winfried Fellenz, and John G. Taylor. Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1):32–80, 2001. doi: 10.1109/79.911197

work page doi:10.1109/79.911197 2001
[46]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992, 2019. doi: 10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019
[47]

MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InAdvances in Neural Information Processing Systems, volume 33, pages 5776–5788, 2020

2020
[48]

Harris, K

Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fer- nández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin She...

2020
[49]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-Omni technical report.arXiv preprint arXiv:2503.20215, 2025. A Societal Impact and Responsible Deployment SYMPATHEIAis intended to make spoken assistants more emotionally aware, supporti...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Acknowledges and addresses the user’s emotional state explicitly and warmly
[51]

Answers their actual question or addresses their request with real, useful content
[52]

Weaves the emotional support and the topic together
[53]

I don’t care how fancy this city looks on postcards — what’s the point of even trying to explore it if everything’s closed and no one knows where anything is?

A listener could tell WHAT EMOTION you’re responding to AND what the user asked about Output ONLY the response text -- no quotes, no explanation. B.7 Human Studies B.7.1 Emotion MOS Evaluation For the human evaluation study, we recruit 20 participants through Prolific [36] and compensate them at a competitive hourly rate. All recruited participants are na...
[54]

Hugging Face model-card terms; all- MiniLM-L6-v2 is Apache-2.0

and all-MiniLM- L6-v2 [46, 47] Text emotion classifier and query deduplication embed- dings. Hugging Face model-card terms; all- MiniLM-L6-v2 is Apache-2.0. UTMOS [ 40], BERTScore [ 37], ROUGE-L [38], LoRA [25], and DeepSpeed [42] Evaluation metrics and train- ing infrastructure. UTMOS and BERTScore are MIT Li- cense; ROUGE-score, LoRA/PEFT, and DeepSpeed...

[1] [1]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, 2020

2020

[2] [2]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, 2022

2022

[3] [3]

Towards empathetic open-domain conversation models: A new benchmark and dataset

Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. Towards empathetic open-domain conversation models: A new benchmark and dataset. InProceedings of ACL, 2019

2019

[4] [4]

Klaus R. Scherer. V ocal communication of emotion: A review of research paradigms.Speech Communication, 40(1–2):227–256, 2003

2003

[5] [5]

Roisman, and Thomas S

Zhihong Zeng, Maja Pantic, Glenn I. Roisman, and Thomas S. Huang. A survey of affect recognition methods: Audio, visual, and spontaneous expressions.IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1):39–58, 2009

2009

[6] [6]

James A. Russell. A circumplex model of affect.Journal of Personality and Social Psychology, 39(6):1161–1178, 1980

1980

[7] [7]

Chang, Sungbok Lee, and Shrikanth S

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. IEMOCAP: Interactive emotional dyadic motion capture database.Language Resources and Evaluation, 42(4):335– 359, 2008

2008

[8] [8]

MELD: A multimodal multi-party dataset for emotion recognition in conversa- tions

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. MELD: A multimodal multi-party dataset for emotion recognition in conversa- tions. InProceedings of ACL, 2019

2019

[9] [9]

MoEL: Mixture of empathetic listeners

Zhaojiang Lin, Andrea Madotto, Jamin Shin, Peng Xu, and Pascale Fung. MoEL: Mixture of empathetic listeners. InProceedings of EMNLP-IJCNLP, 2019

2019

[10] [10]

MIME: MIMicking emotions for empathetic response generation

Navonil Majumder, Pengfei Hong, Shanshan Peng, Jiankun Lu, Deepanway Ghosal, Alexander Gelbukh, Rada Mihalcea, and Soujanya Poria. MIME: MIMicking emotions for empathetic response generation. InProceedings of EMNLP, 2020

2020

[11] [11]

Reza Lotfian and Carlos Busso. Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings.IEEE Transactions on Affective Computing, 10(4):471–483, 2019. doi: 10.1109/TAFFC.2017.2736999. 10

work page doi:10.1109/taffc.2017.2736999 2019

[12] [12]

Salman, Wei-Cheng Lin, Lucas Goncalves, Srinivas Parthasarathy, Abinay Reddy Naini, Seong-Gyun Leem, Luz Martinez-Lucas, Huang- Cheng Chou, and Pravin Mote

Carlos Busso, Reza Lotfian, Kusha Sridhar, Ali N. Salman, Wei-Cheng Lin, Lucas Goncalves, Srinivas Parthasarathy, Abinay Reddy Naini, Seong-Gyun Leem, Luz Martinez-Lucas, Huang- Cheng Chou, and Pravin Mote. The MSP-Podcast corpus.arXiv preprint arXiv:2509.09791, 2025

work page arXiv 2025

[13] [13]

Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. AffectNet: A database for facial expression, valence, and arousal computing in the wild.IEEE Transactions on Affective Computing, 10(1):18–31, 2019

2019

[14] [14]

DEAP: A database for emotion analysis using physiological signals.IEEE Transactions on Affective Computing, 3(1): 18–31, 2012

Sander Koelstra, Christian Mühl, Mohammad Soleymani, Jong-Seok Lee, Ashkan Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras. DEAP: A database for emotion analysis using physiological signals.IEEE Transactions on Affective Computing, 3(1): 18–31, 2012

2012

[15] [15]

SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities,

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities.arXiv preprint arXiv:2305.11000, 2023

work page arXiv 2023

[16] [16]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: A speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. GLM-4-V oice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, et al. Qwen3-Omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Kimi-Audio Technical Report

KimiTeam, Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, et al. Kimi-Audio technical report.arXiv preprint arXiv:2504.18425, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

BLSP-Emo: Towards empathetic large speech-language models

Chen Wang, Minpeng Liao, Zhongqiang Huang, Junhong Wu, Chengqing Zong, and Jiajun Zhang. BLSP-Emo: Towards empathetic large speech-language models. InProceedings of EMNLP, 2024

2024

[21] [21]

OpenS2S: Advancing fully open- source end-to-end empathetic large speech language model.arXiv preprint arXiv:2507.05177, 2025

Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxi- ang Wu, Jinqiao Wang, Chengqing Zong, and Jiajun Zhang. OpenS2S: Advancing fully open- source end-to-end empathetic large speech language model.arXiv preprint arXiv:2507.05177, 2025

work page arXiv 2025

[22] [22]

Empathy Omni: Enabling empathetic speech response generation through large language models.arXiv preprint arXiv:2508.18655, 2025

Haoyu Wang, Guangyan Zhang, Jiale Chen, Jingyu Li, Yuehai Wang, and Yiwen Guo. Empathy Omni: Enabling empathetic speech response generation through large language models.arXiv preprint arXiv:2508.18655, 2025

work page arXiv 2025

[23] [23]

OSUM-EChat: Enhancing end-to-end empathetic spoken chatbot via understanding-driven spoken dialogue.arXiv preprint arXiv:2508.09600, 2025

Xuelong Geng, Qijie Shao, Hongfei Xue, Shuiyuan Wang, Hanke Xie, Zhao Guo, Yi Zhao, Guojian Li, Wenjie Tian, Chengyou Wang, Zhixian Zhao, Kangxiang Xia, Ziyu Zhang, Zhennan Lin, Tianlun Zuo, Mingchen Shao, Yuang Cao, Guobin Ma, Longhao Li, Yuhang Dai, Dehui Gao, Dake Guo, and Lei Xie. OSUM-EChat: Enhancing end-to-end empathetic spoken chatbot via understa...

work page arXiv 2025

[24] [24]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision.arXiv preprint arXiv:2212.04356, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[25] [25]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In Proceedings of ICLR, 2022. 11

2022

[26] [26]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, et al. Qwen3-TTS technical report.arXiv preprint arXiv:2601.15621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Savchenko

Andrey V . Savchenko. HSEmotion: High-speed emotion recognition library.Software Impacts, 14:100433, 2022

2022

[29] [29]

SEED-VII: A multimodal dataset of six basic emotions with continuous labels for emotion recognition.IEEE Transactions on Affective Computing, 16(2):969–985, 2025

Wei-Bang Jiang, Xuan-Hao Liu, Wei-Long Zheng, and Bao-Liang Lu. SEED-VII: A multimodal dataset of six basic emotions with continuous labels for emotion recognition.IEEE Transactions on Affective Computing, 16(2):969–985, 2025

2025

[30] [30]

YAAD: Young adult’s affective data using wearable ECG and GSR sensors

Muhammad Najam Dar, Amna Rahim, Muhammad Usman Akram, Sajid Gul Khawaja, and Aqsa Rahim. YAAD: Young adult’s affective data using wearable ECG and GSR sensors. In2022 2nd International Conference on Digital Futures and Transformative Technologies (ICoDT2), pages 1–7. IEEE, 2022

2022

[31] [31]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of CVPR, 2016

2016

[32] [32]

Scherer and Harald G

Klaus R. Scherer and Harald G. Wallbott. Evidence for universality and cultural variation of differential emotion response patterning.Journal of Personality and Social Psychology, 66(2): 310–328, 1994

1994

[33] [33]

Emotion English DistilRoBERTa-base

Jochen Hartmann. Emotion English DistilRoBERTa-base. https://huggingface.co/ j-hartmann/emotion-english-distilroberta-base/, 2022

2022

[34] [34]

URO-Bench: Towards comprehensive evaluation for end-to-end spoken dialogue models

Ruiqi Yan, Xiquan Li, Wenxi Chen, Zhikang Niu, Chen Yang, Ziyang Ma, Kai Yu, and Xie Chen. URO-Bench: Towards comprehensive evaluation for end-to-end spoken dialogue models. InFindings of the Association for Computational Linguistics: EMNLP, 2025

2025

[35] [35]

VoiceBench: Benchmarking LLM-Based Voice Assistants

Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T. Tan, and Haizhou Li. V oiceBench: Benchmarking LLM-based voice assistants.arXiv preprint arXiv:2410.17196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Prolific.ac–a subject pool for online experiments.Journal of Behavioral and Experimental Finance, 17:22–27, 2018

Stefan Palan and Christian Schitter. Prolific.ac–a subject pool for online experiments.Journal of Behavioral and Experimental Finance, 17:22–27, 2018. doi: 10.1016/j.jbef.2017.12.004

work page doi:10.1016/j.jbef.2017.12.004 2018

[37] [37]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with BERT. InProceedings of ICLR, 2020

2020

[38] [38]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, 2004

2004

[39] [39]

Mini-Omni: Language models can hear, talk while thinking in streaming.arXiv preprint arXiv:2408.16725, 2024

Zhifei Xie and Changqiao Wu. Mini-Omni: Language models can hear, talk while thinking in streaming.arXiv preprint arXiv:2408.16725, 2024

work page arXiv 2024

[40] [40]

UTMOS: UTokyo-SaruLab system for V oiceMOS challenge 2022

Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. UTMOS: UTokyo-SaruLab system for V oiceMOS challenge 2022. In Proceedings of Interspeech, pages 4521–4525, 2022

2022

[41] [41]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of ICLR, 2019

2019

[42] [42]

DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of KDD, 2020

2020

[43] [43]

A dictionary of affect in language: I

Kevin Sweeney and Cynthia Whissell. A dictionary of affect in language: I. establishment and preliminary validation.Perceptual and Motor Skills, 59(3):695–698, 1984. doi: 10.2466/pms. 1984.59.3.695. 12

work page doi:10.2466/pms 1984

[44] [44]

Harper & Row, New York, 1980

Robert Plutchik.Emotion: A Psychoevolutionary Synthesis. Harper & Row, New York, 1980

1980

[45] [45]

Roddy Cowie, Ellen Douglas-Cowie, Nicolas Tsapatsoulis, George V otsis, Stefanos Kollias, Winfried Fellenz, and John G. Taylor. Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1):32–80, 2001. doi: 10.1109/79.911197

work page doi:10.1109/79.911197 2001

[46] [46]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992, 2019. doi: 10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019

[47] [47]

MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InAdvances in Neural Information Processing Systems, volume 33, pages 5776–5788, 2020

2020

[48] [48]

Harris, K

Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fer- nández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin She...

2020

[49] [49]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-Omni technical report.arXiv preprint arXiv:2503.20215, 2025. A Societal Impact and Responsible Deployment SYMPATHEIAis intended to make spoken assistants more emotionally aware, supporti...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Acknowledges and addresses the user’s emotional state explicitly and warmly

[51] [51]

Answers their actual question or addresses their request with real, useful content

[52] [52]

Weaves the emotional support and the topic together

[53] [53]

I don’t care how fancy this city looks on postcards — what’s the point of even trying to explore it if everything’s closed and no one knows where anything is?

A listener could tell WHAT EMOTION you’re responding to AND what the user asked about Output ONLY the response text -- no quotes, no explanation. B.7 Human Studies B.7.1 Emotion MOS Evaluation For the human evaluation study, we recruit 20 participants through Prolific [36] and compensate them at a competitive hourly rate. All recruited participants are na...

[54] [54]

Hugging Face model-card terms; all- MiniLM-L6-v2 is Apache-2.0

and all-MiniLM- L6-v2 [46, 47] Text emotion classifier and query deduplication embed- dings. Hugging Face model-card terms; all- MiniLM-L6-v2 is Apache-2.0. UTMOS [ 40], BERTScore [ 37], ROUGE-L [38], LoRA [25], and DeepSpeed [42] Evaluation metrics and train- ing infrastructure. UTMOS and BERTScore are MIT Li- cense; ROUGE-score, LoRA/PEFT, and DeepSpeed...