SpEx+: A Complete Time Domain Speaker Extraction Network

Chenglin Xu; Eng Siong Chng; Haizhou Li; Jianwu Dang; Longbiao Wang; Meng Ge

arxiv: 2005.04686 · v2 · pith:MQ55BJAUnew · submitted 2020-05-10 · 📡 eess.AS · cs.SD

SpEx+: A Complete Time Domain Speaker Extraction Network

Meng Ge , Chenglin Xu , Longbiao Wang , Eng Siong Chng , Jianwu Dang , Haizhou Li This is my paper

classification 📡 eess.AS cs.SD

keywords speakerspextime-domainextractionspeechfrequency-domainsolutioncomplete

0 comments

read the original abstract

Speaker extraction aims to extract the target speech signal from a multi-talker environment given a target speaker's reference speech. We recently proposed a time-domain solution, SpEx, that avoids the phase estimation in frequency-domain approaches. Unfortunately, SpEx is not fully a time-domain solution since it performs time-domain speech encoding for speaker extraction, while taking frequency-domain speaker embedding as the reference. The size of the analysis window for time-domain and the size for frequency-domain input are also different. Such mismatch has an adverse effect on the system performance. To eliminate such mismatch, we propose a complete time-domain speaker extraction solution, that is called SpEx+. Specifically, we tie the weights of two identical speech encoder networks, one for the encoder-extractor-decoder pipeline, another as part of the speaker encoder. Experiments show that the SpEx+ achieves 0.8dB and 2.1dB SDR improvement over the state-of-the-art SpEx baseline, under different and same gender conditions on WSJ0-2mix-extr database respectively.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues
cs.SD 2024-11 unverdicted novelty 5.0

pTSE-T conditions TSE on unaligned text semantic cues via TPE network for mask generation, reporting SI-SDRi of 12.16 dB.