End-to-End Speech Recognition From the Raw Waveform

Emmanuel Dupoux; Gabriel Synnaeve; Neil Zeghidour; Nicolas Usunier; Ronan Collobert

arxiv: 1806.07098 · v2 · pith:NGGTUPEMnew · submitted 2018-06-19 · 💻 cs.CL · cs.SD· eess.AS

End-to-End Speech Recognition From the Raw Waveform

Neil Zeghidour , Nicolas Usunier , Gabriel Synnaeve , Ronan Collobert , Emmanuel Dupoux This is my paper

classification 💻 cs.CL cs.SDeess.AS

keywords filterbanksmel-filterbankstrainableend-to-endfirstwaveformapproachesmodifications

0 comments

read the original abstract

State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the waveform before the training pipeline. In this paper, we study end-to-end systems trained directly from the raw waveform, building on two alternatives for trainable replacements of mel-filterbanks that use a convolutional architecture. The first one is inspired by gammatone filterbanks (Hoshen et al., 2015; Sainath et al, 2015), and the second one by the scattering transform (Zeghidour et al., 2017). We propose two modifications to these architectures and systematically compare them to mel-filterbanks, on the Wall Street Journal dataset. The first modification is the addition of an instance normalization layer, which greatly improves on the gammatone-based trainable filterbanks and speeds up the training of the scattering-based filterbanks. The second one relates to the low-pass filter used in these approaches. These modifications consistently improve performances for both approaches, and remove the need for a careful initialization in scattering-based trainable filterbanks. In particular, we show a consistent improvement in word error rate of the trainable filterbanks relatively to comparable mel-filterbanks. It is the first time end-to-end models trained from the raw signal significantly outperform mel-filterbanks on a large vocabulary task under clean recording conditions.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

End-to-End ASR for Code-switched Hindi-English Speech
eess.AS 2019-06 unverdicted novelty 4.0

End-to-end ASR for code-switched Hindi-English with <50 hours of data shows gains from multi-task learning and corpus balancing but underperforms cascaded baselines.