pith. machine review for the scientific record. sign in

arxiv: 1809.00496 · v2 · submitted 2018-09-03 · 💻 cs.CV

Recognition: unknown

LRS3-TED: a large-scale dataset for visual speech recognition

Authors on Pith no claims yet
classification 💻 cs.CV
keywords datasetrecognitionspeechvisualalignmentalongaudio-visualavailable
0
0 comments X
read the original abstract

This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. The new dataset is substantially larger in scale compared to other public datasets that are available for general research.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Hierarchical Codec Diffusion for Video-to-Speech Generation

    cs.SD 2026-04 unverdicted novelty 7.0

    HiCoDiT generates speech from video by conditioning low-level RVQ tokens on speaker identity and high-level tokens on facial expressions via a dual-scale normalized diffusion transformer.

  2. CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing

    cs.SD 2026-04 unverdicted novelty 7.0

    CoSyncDiT is a cognitive-inspired diffusion transformer that achieves state-of-the-art lip synchronization and naturalness in movie dubbing by guiding noise-to-speech generation through acoustic, visual, and contextua...

  3. OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

    cs.SD 2026-04 unverdicted novelty 7.0

    OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.

  4. LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

    eess.AS 2026-04 unverdicted novelty 6.0

    LRS-VoxMM is a new in-the-wild AVSR benchmark that is harder than LRS3 and demonstrates increasing value of visual information under acoustic degradation.

  5. Tracking Listener Attention: Gaze-Guided Audio-Visual Speech Enhancement Framework

    eess.AS 2026-04 unverdicted novelty 6.0

    The GG-AVSE framework uses listener gaze direction combined with YOLO5Face and AVSEMamba to resolve target-speaker ambiguity in audio-visual speech enhancement, yielding gains in PESQ, STOI, and SI-SDR.

  6. Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning

    cs.SD 2026-05 unverdicted novelty 5.0

    DPC-Net improves stage-wise audio-visual learning by correcting readiness deficiencies in fused representations using cross-layer and cross-modal evidence.

  7. BUT System Description for CHiME-9 MCoRec Challenge

    eess.AS 2026-04 unverdicted novelty 3.0

    BUT's CHiME-9 MCoRec system conditions Parakeet-v2 ASR on AV-HuBERT visuals for 33.7% WER and uses Qwen3.5 LLM for hierarchical clustering to reach 0.97 F1, beating the baseline by 16.2% WER and 0.15 F1 on the develop...