End-to-end ASR model with speaker-specific cross-attention for two-party conversations outperforms standard models on the Switchboard corpus.
Deep context: end-to-end contextual speech recognition
2 Pith papers cite this work. Polarity classification is still indexing.
abstract
In automatic speech recognition (ASR) what a user says depends on the particular context she is in. Typically, this context is represented as a set of word n-grams. In this work, we present a novel, all-neural, end-to-end (E2E) ASR sys- tem that utilizes such context. Our approach, which we re- fer to as Contextual Listen, Attend and Spell (CLAS) jointly- optimizes the ASR components along with embeddings of the context n-grams. During inference, the CLAS system can be presented with context phrases which might contain out-of- vocabulary (OOV) terms not seen during training. We com- pare our proposed system to a more traditional contextualiza- tion approach, which performs shallow-fusion between inde- pendently trained LAS and contextual n-gram models during beam search. Across a number of tasks, we find that the pro- posed CLAS system outperforms the baseline method by as much as 68% relative WER, indicating the advantage of joint optimization over individually trained components. Index Terms: speech recognition, sequence-to-sequence models, listen attend and spell, LAS, attention, embedded speech recognition.
years
2019 2verdicts
UNVERDICTED 2representative citing papers
An E2E ASR model with mixed wordpieces and phonemes improves foreign proper noun recognition via phoneme-level contextual biasing, showing 16% gain over grapheme-only and 8% over wordpiece-only baselines.
citing papers explorer
-
Cross-Attention End-to-End ASR for Two-Party Conversations
End-to-end ASR model with speaker-specific cross-attention for two-party conversations outperforms standard models on the Switchboard corpus.
-
Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models
An E2E ASR model with mixed wordpieces and phonemes improves foreign proper noun recognition via phoneme-level contextual biasing, showing 16% gain over grapheme-only and 8% over wordpiece-only baselines.