pith. machine review for the scientific record. sign in

arxiv: 1810.00108 · v1 · submitted 2018-09-28 · 💻 cs.CV

Recognition: unknown

Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture

Authors on Pith no claims yet
classification 💻 cs.CV
keywords audio-visualmodelrecognitionarchitecturespeecherrorhybridrate
0
0 comments X
read the original abstract

Recent works in speech recognition rely either on connectionist temporal classification (CTC) or sequence-to-sequence models for character-level recognition. CTC assumes conditional independence of individual characters, whereas attention-based models can provide nonsequential alignments. Therefore, we could use a CTC loss in combination with an attention-based model in order to force monotonic alignments and at the same time get rid of the conditional independence assumption. In this paper, we use the recently proposed hybrid CTC/attention architecture for audio-visual recognition of speech in-the-wild. To the best of our knowledge, this is the first time that such a hybrid architecture architecture is used for audio-visual recognition of speech. We use the LRS2 database and show that the proposed audio-visual model leads to an 1.3% absolute decrease in word error rate over the audio-only model and achieves the new state-of-the-art performance on LRS2 database (7% word error rate). We also observe that the audio-visual model significantly outperforms the audio-based model (up to 32.9% absolute improvement in word error rate) for several different types of noise as the signal-to-noise ratio decreases.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

    cs.CV 2026-05 unverdicted novelty 7.0

    CoDAAR creates a unified discrete representation space for multimodal sequences by aligning modality-specific codebooks through index-level semantic consensus, enabling both specificity and cross-modal generalization.

  2. Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

    cs.CV 2026-05 unverdicted novelty 7.0

    CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state...