LRS3-TED: a large-scale dataset for visual speech recognition

Triantafyllos Afouras , Joon Son Chung , Andrew Zisserman

Authors on Pith no claims yet

classification 💻 cs.CV

keywords datasetrecognitionspeechvisualalignmentalongaudio-visualavailable

read the original abstract

This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. The new dataset is substantially larger in scale compared to other public datasets that are available for general research.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Hierarchical Codec Diffusion for Video-to-Speech Generation
cs.SD 2026-04 unverdicted novelty 7.0

HiCoDiT generates speech from video by conditioning low-level RVQ tokens on speaker identity and high-level tokens on facial expressions via a dual-scale normalized diffusion transformer.
CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing
cs.SD 2026-04 unverdicted novelty 7.0

CoSyncDiT is a cognitive-inspired diffusion transformer that achieves state-of-the-art lip synchronization and naturalness in movie dubbing by guiding noise-to-speech generation through acoustic, visual, and contextua...
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
cs.SD 2026-04 unverdicted novelty 7.0

OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.
LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition
eess.AS 2026-04 unverdicted novelty 6.0

LRS-VoxMM is a new in-the-wild AVSR benchmark that is harder than LRS3 and demonstrates increasing value of visual information under acoustic degradation.
Tracking Listener Attention: Gaze-Guided Audio-Visual Speech Enhancement Framework
eess.AS 2026-04 unverdicted novelty 6.0

The GG-AVSE framework uses listener gaze direction combined with YOLO5Face and AVSEMamba to resolve target-speaker ambiguity in audio-visual speech enhancement, yielding gains in PESQ, STOI, and SI-SDR.
Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning
cs.SD 2026-05 unverdicted novelty 5.0

DPC-Net improves stage-wise audio-visual learning by correcting readiness deficiencies in fused representations using cross-layer and cross-modal evidence.
BUT System Description for CHiME-9 MCoRec Challenge
eess.AS 2026-04 unverdicted novelty 3.0

BUT's CHiME-9 MCoRec system conditions Parakeet-v2 ASR on AV-HuBERT visuals for 33.7% WER and uses Qwen3.5 LLM for hierarchical clustering to reach 0.97 F1, beating the baseline by 16.2% WER and 0.15 F1 on the develop...