pith. sign in

arxiv: 2309.07586 · v1 · pith:X6L6GXXPnew · submitted 2023-09-14 · 📡 eess.AS · cs.SD

Emo-StarGAN: A Semi-Supervised Any-to-Many Non-Parallel Emotion-Preserving Voice Conversion

classification 📡 eess.AS cs.SD
keywords emotionanonymisationstarganv2-vcany-to-manyconversiondatanon-parallelpreservation
0
0 comments X
read the original abstract

Speech anonymisation prevents misuse of spoken data by removing any personal identifier while preserving at least linguistic content. However, emotion preservation is crucial for natural human-computer interaction. The well-known voice conversion technique StarGANv2-VC achieves anonymisation but fails to preserve emotion. This work presents an any-to-many semi-supervised StarGANv2-VC variant trained on partially emotion-labelled non-parallel data. We propose emotion-aware losses computed on the emotion embeddings and acoustic features correlated to emotion. Additionally, we use an emotion classifier to provide direct emotion supervision. Objective and subjective evaluations show that the proposed approach significantly improves emotion preservation over the vanilla StarGANv2-VC. This considerable improvement is seen over diverse datasets, emotions, target speakers, and inter-group conversions without compromising intelligibility and anonymisation.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DDPO-VC: Speaker De-Identification via Diffusion Denoising Policy Optimization

    eess.AS 2026-06 unverdicted novelty 6.0

    DDPO-VC applies diffusion denoising policy optimization with dual-teacher rewards to improve speaker de-identification while preserving cognitive utility on dementia speech benchmarks.