Recognition: unknown
LRS3-TED: a large-scale dataset for visual speech recognition
read the original abstract
This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. The new dataset is substantially larger in scale compared to other public datasets that are available for general research.
This paper has not been read by Pith yet.
Forward citations
Cited by 7 Pith papers
-
Hierarchical Codec Diffusion for Video-to-Speech Generation
HiCoDiT generates speech from video by conditioning low-level RVQ tokens on speaker identity and high-level tokens on facial expressions via a dual-scale normalized diffusion transformer.
-
CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing
CoSyncDiT is a cognitive-inspired diffusion transformer that achieves state-of-the-art lip synchronization and naturalness in movie dubbing by guiding noise-to-speech generation through acoustic, visual, and contextua...
-
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.
-
LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition
LRS-VoxMM is a new in-the-wild AVSR benchmark that is harder than LRS3 and demonstrates increasing value of visual information under acoustic degradation.
-
Tracking Listener Attention: Gaze-Guided Audio-Visual Speech Enhancement Framework
The GG-AVSE framework uses listener gaze direction combined with YOLO5Face and AVSEMamba to resolve target-speaker ambiguity in audio-visual speech enhancement, yielding gains in PESQ, STOI, and SI-SDR.
-
Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning
DPC-Net improves stage-wise audio-visual learning by correcting readiness deficiencies in fused representations using cross-layer and cross-modal evidence.
-
BUT System Description for CHiME-9 MCoRec Challenge
BUT's CHiME-9 MCoRec system conditions Parakeet-v2 ASR on AV-HuBERT visuals for 33.7% WER and uses Qwen3.5 LLM for hierarchical clustering to reach 0.97 F1, beating the baseline by 16.2% WER and 0.15 F1 on the develop...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.