Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM

· 2017 · cs.CL · arXiv 1706.02737

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions, the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10\% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.

representative citing papers

Cross-Attention End-to-End ASR for Two-Party Conversations

eess.AS · 2019-07-24 · unverdicted · novelty 6.0

End-to-end ASR model with speaker-specific cross-attention for two-party conversations outperforms standard models on the Switchboard corpus.

Self Multi-Head Attention for Speaker Recognition

cs.SD · 2019-06-24 · unverdicted · novelty 6.0

Self multi-head attention applied after CNN encoding of spectrograms outperforms temporal and statistical pooling for speaker verification on VoxCeleb1 with 18% relative EER reduction.

citing papers explorer

Showing 2 of 2 citing papers.

Cross-Attention End-to-End ASR for Two-Party Conversations eess.AS · 2019-07-24 · unverdicted · none · ref 45 · internal anchor
End-to-end ASR model with speaker-specific cross-attention for two-party conversations outperforms standard models on the Switchboard corpus.
Self Multi-Head Attention for Speaker Recognition cs.SD · 2019-06-24 · unverdicted · none · ref 30 · internal anchor
Self multi-head attention applied after CNN encoding of spectrograms outperforms temporal and statistical pooling for speaker verification on VoxCeleb1 with 18% relative EER reduction.

Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM

fields

years

verdicts

representative citing papers

citing papers explorer