Attention-Based Models for Text-Dependent Speaker Verification
read the original abstract
Attention-based models have recently shown great performance on a range of tasks, such as speech recognition, machine translation, and image captioning due to their ability to summarize relevant information that expands through the entire length of an input sequence. In this paper, we analyze the usage of attention mechanisms to the problem of sequence summarization in our end-to-end text-dependent speaker recognition system. We explore different topologies and their variants of the attention layer, and compare different pooling methods on the attention weights. Ultimately, we show that attention-based models can improves the Equal Error Rate (EER) of our speaker verification system by relatively 14% compared to our non-attention LSTM baseline model.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
Speaker Recognition with Random Digit Strings Using Uncertainty Normalized HMM-based i-vectors
Digit-specific HMM i-vectors with uncertainty normalization reach 1.52% male and 1.77% female EER on RSR2015 part III using only that corpus and simple cosine scoring.
-
Self Multi-Head Attention for Speaker Recognition
Self multi-head attention applied after CNN encoding of spectrograms outperforms temporal and statistical pooling for speaker verification on VoxCeleb1 with 18% relative EER reduction.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.