Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data

Doo-young Kim; Myun-chul Joe; Seung-won Park

arxiv: 2005.03295 · v2 · pith:ORVHXAP2new · submitted 2020-05-07 · 📡 eess.AS · cs.LG· cs.SD

Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data

Seung-won Park , Doo-young Kim , Myun-chul Joe This is my paper

classification 📡 eess.AS cs.LGcs.SD

keywords cotatronspeechsystemavailableconversionencoderpreviousspeakers

0 comments

read the original abstract

We propose Cotatron, a transcription-guided speech encoder for speaker-independent linguistic representation. Cotatron is based on the multispeaker TTS architecture and can be trained with conventional TTS datasets. We train a voice conversion system to reconstruct speech with Cotatron features, which is similar to the previous methods based on Phonetic Posteriorgram (PPG). By training and evaluating our system with 108 speakers from the VCTK dataset, we outperform the previous method in terms of both naturalness and speaker similarity. Our system can also convert speech from speakers that are unseen during training, and utilize ASR to automate the transcription with minimal reduction of the performance. Audio samples are available at https://mindslab-ai.github.io/cotatron, and the code with a pre-trained model will be made available soon.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Linguistically Augmented Audio Speech Data (LinguAS)
cs.SD 2026-06 unverdicted novelty 7.0

Introduces the LinguAS dataset of genuine and deepfaked audio annotated with expert-defined linguistic features to improve detection model performance over ASVspoof 2021 and SSL baselines.