arxiv: 2604.27866 · v1 · submitted 2026-04-30 · 📡 eess.AS · cs.MM· cs.SD

Recognition: unknown

LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

Doyeop Kwak , Jeongsoo Choi , Suyeon Lee , Joon Son Chung

Authors on Pith no claims yet

Pith reviewed 2026-05-07 05:36 UTC · model grok-4.3

classification 📡 eess.AS cs.MMcs.SD

keywords audio-visual speech recognitionAVSRbenchmarkin-the-wildVoxMMLRS3acoustic degradationvisual information

0 comments

The pith

LRS-VoxMM benchmark shows audio-visual speech recognition is considerably harder than LRS3, especially when audio degrades and visual cues become more important.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LRS-VoxMM, a new benchmark for audio-visual speech recognition created by selecting suitable samples from the VoxMM dataset of real-world conversations and formatting them in the same style as the LRS3 benchmark. It also supplies versions of the data distorted by noise, reverberation, and bandwidth limits to simulate tough acoustic conditions. Experiments reveal that models perform worse on LRS-VoxMM than on LRS3 and that visual information helps more as audio quality falls. A sympathetic reader would care because most existing AVSR work uses clean, controlled recordings that may overestimate how well systems will work in everyday noisy or varied environments. The benchmark therefore offers a way to test and improve systems for more practical use.

Core claim

We introduce LRS-VoxMM as an in-the-wild benchmark for audio-visual speech recognition derived from VoxMM, a dataset of diverse real-world spoken conversations with human-annotated transcriptions. We select AVSR-suitable samples and preprocess them in an LRS-style format for direct use in existing AVSR pipelines. Compared with commonly used benchmarks, LRS-VoxMM covers a more diverse range of scenarios and acoustic conditions. We also release distorted evaluation sets with additive noise, reverberation, and bandwidth limitation to support evaluation under severe acoustic degradation. Experimental results show that LRS-VoxMM is considerably harder than LRS3 and that the contribution of visual

What carries the argument

The LRS-VoxMM benchmark, formed by selecting AVSR-suitable samples from VoxMM and preprocessing them into LRS-style format, together with its accompanying distorted evaluation sets that add noise, reverberation, and bandwidth limitation. This construction lets existing AVSR pipelines run directly on more varied real-world data and isolates how visual input helps when audio is impaired.

If this is right

Models trained or tested only on LRS3 will show noticeably higher error rates when evaluated on LRS-VoxMM.
The accuracy improvement gained by using both audio and visual streams grows larger as audio quality is reduced by noise or reverberation.
The released distorted sets let researchers measure how well models handle specific types of acoustic corruption one at a time.
AVSR development should focus on fusion methods that make better use of visual cues precisely when audio is unreliable.
Realistic benchmarking of this kind supports progress toward systems that can be deployed in everyday noisy settings rather than controlled recordings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Systems that reach low error rates on LRS-VoxMM are more likely to maintain performance in practical settings such as video calls or recordings made in public spaces.
The benchmark highlights the potential value of stronger visual-only speech recognition components that can serve as a fallback when audio is severely degraded.
Future training regimes could combine LRS-VoxMM with other datasets to build models that remain robust across a wider range of acoustic and visual conditions.
Continued reliance on clean benchmarks like LRS3 may produce overly optimistic estimates of how ready current AVSR technology is for real-world use.

Load-bearing premise

The samples chosen from VoxMM as AVSR-suitable, after LRS-style preprocessing, accurately represent diverse real-world scenarios and acoustic conditions without introducing selection bias.

What would settle it

If current AVSR models achieve word error rates on LRS-VoxMM that are comparable to those on LRS3, or if the accuracy gain from adding visual input stays the same or shrinks on the distorted sets, the claim that the new benchmark is considerably harder and that visual information becomes more important under degradation would be falsified.

read the original abstract

We introduce LRS-VoxMM, an in-the-wild benchmark for audio-visual speech recognition (AVSR). The benchmark is derived from VoxMM, a dataset of diverse real-world spoken conversations with human-annotated transcriptions. We select AVSR-suitable samples and preprocess them in an LRS-style format for direct use in existing AVSR pipelines. Compared with commonly used benchmarks, LRS-VoxMM covers a more diverse range of scenarios and acoustic conditions. We also release distorted evaluation sets with additive noise, reverberation, and bandwidth limitation to support evaluation under severe acoustic degradation. Experimental results show that LRS-VoxMM is considerably harder than LRS3 and that the contribution of visual information becomes more evident as the audio signal degrades. LRS-VoxMM supports more realistic AVSR benchmarking and encourages further research on the role of visual information in challenging real-world conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LRS-VoxMM creates a tougher AVSR benchmark from real conversations but its sample selection may overstate the value of visuals under degradation.

read the letter

LRS-VoxMM is a benchmark derived from VoxMM real-world conversations. It uses LRS-style preprocessing and includes new distorted test sets for noise, reverb, and bandwidth limits. The experiments indicate it is harder than LRS3 and that visual information helps more when audio degrades. The paper contributes by curating diverse scenarios from an annotated dataset and making the degraded versions public. This setup allows direct comparison with prior work and supports testing under realistic acoustic problems. Running standard AVSR models on it provides a baseline for how current systems perform. The selection of AVSR-suitable samples is a soft spot because it may retain clips with stronger visual signals and exaggerate the visual contribution in degraded conditions. The paper should clarify the exact criteria and show whether the retained data still spans the full variety of VoxMM. Without that, the claims about in-the-wild difficulty and visual utility rest on an unverified assumption. AVSR researchers building robust systems will find this useful for evaluation. Readers interested in multimodal benchmarks get value from the released data and the reported trends. The work has enough new content to warrant peer review, where the selection details and quantitative results can be checked. I would send it for review.

Referee Report

2 major / 2 minor

Summary. The paper introduces LRS-VoxMM, an in-the-wild AVSR benchmark derived from the VoxMM dataset of real-world conversations. AVSR-suitable samples are selected and preprocessed in LRS-style format (face tracking, cropping, alignment) for compatibility with existing pipelines. Distorted evaluation sets are released with additive noise, reverberation, and bandwidth limitation. The central claim is that LRS-VoxMM is considerably harder than LRS3 and that the contribution of visual information increases as the audio signal degrades.

Significance. If the sample selection avoids bias toward high-quality visual cues and the experimental comparisons hold, LRS-VoxMM would provide a valuable, more diverse testbed than LRS3 for evaluating AVSR robustness in realistic acoustic conditions. The release of the benchmark and the three families of distorted sets is a concrete strength that supports reproducible research on when and how visual cues help under degradation.

major comments (2)

[§3] §3 (Benchmark construction): The criteria for designating samples as 'AVSR-suitable' are not defined (no thresholds on face frontalness, lip visibility, occlusion, or acoustic SNR are stated, nor are rejection rates or statistics comparing retained vs. discarded clips). Because the pipeline first filters then applies LRS-style preprocessing, any preference for clear visual speech in the filter directly affects the measured visual benefit on the distorted sets; this selection step is load-bearing for both the 'harder than LRS3' and 'visual contribution increases with degradation' claims.
[§4] §4 (Experiments): The manuscript asserts that LRS-VoxMM is 'considerably harder' than LRS3 and that visual utility grows with degradation, yet no quantitative results (WER tables for audio-only, visual-only, and AV models on clean and each distorted condition), baseline model specifications, or error analysis are provided. Without these numbers and controls it is impossible to verify the central experimental claims or to assess whether the observed trends survive the potential selection bias identified above.

minor comments (2)

The abstract would be more informative if it included one or two concrete metrics (e.g., absolute WER on LRS-VoxMM vs. LRS3 under matched conditions) rather than only qualitative statements.
[Tables] Tables comparing LRS-VoxMM with LRS3 should explicitly list the exact acoustic conditions, model architectures, and training data used for each entry to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript introducing LRS-VoxMM. The comments highlight important areas for improving clarity, reproducibility, and the strength of our experimental claims. We address each major comment below and will make the corresponding revisions to the manuscript.

read point-by-point responses

Referee: [§3] §3 (Benchmark construction): The criteria for designating samples as 'AVSR-suitable' are not defined (no thresholds on face frontalness, lip visibility, occlusion, or acoustic SNR are stated, nor are rejection rates or statistics comparing retained vs. discarded clips). Because the pipeline first filters then applies LRS-style preprocessing, any preference for clear visual speech in the filter directly affects the measured visual benefit on the distorted sets; this selection step is load-bearing for both the 'harder than LRS3' and 'visual contribution increases with degradation' claims.

Authors: We agree that explicit criteria and statistics are essential for reproducibility and for evaluating potential selection bias. The original manuscript described the selection at a high level to maintain focus on the benchmark release and distorted sets. In the revised version, we will expand §3 with the precise thresholds applied (e.g., maximum yaw/pitch angles for frontalness, minimum lip visibility and occlusion scores from the face tracker, and minimum acoustic SNR), the exact rejection rate, and comparative statistics (duration, speaker diversity, acoustic conditions) between retained and discarded clips. These additions will allow readers to assess whether the retained samples introduce bias toward high-quality visual cues and will directly support the validity of the hardness and visual-utility claims. revision: yes
Referee: [§4] §4 (Experiments): The manuscript asserts that LRS-VoxMM is 'considerably harder' than LRS3 and that visual utility grows with degradation, yet no quantitative results (WER tables for audio-only, visual-only, and AV models on clean and each distorted condition), baseline model specifications, or error analysis are provided. Without these numbers and controls it is impossible to verify the central experimental claims or to assess whether the observed trends survive the potential selection bias identified above.

Authors: We acknowledge that the current experimental section relies on summary statements rather than exhaustive tables and controls. While the abstract and main text reference the key trends, we will add a dedicated results subsection (or expanded table) in the revision that reports word error rates for audio-only, visual-only, and audio-visual models on the clean LRS-VoxMM set and on every distorted condition (additive noise, reverberation, bandwidth limitation). We will also document the exact baseline architectures, training procedures, and hyperparameters used, and include a short error analysis examining error patterns as a function of acoustic degradation. These quantitative details will enable independent verification of the “considerably harder” claim relative to LRS3 and of the increasing visual contribution, while also allowing assessment of robustness to the selection criteria. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark derived from external data with independent evaluation

full rationale

The paper constructs LRS-VoxMM by selecting AVSR-suitable samples from the external VoxMM dataset and applying standard LRS-style preprocessing (face tracking, cropping, alignment). Claims of greater difficulty than LRS3 and increasing visual utility under degradation rest on empirical runs of existing AVSR models on the new benchmark and its distorted variants. No equations, fitted parameters, or derivations appear; the selection filter and preprocessing are described as procedural steps without reducing any result to a self-definition or renamed input. Prior LRS citations serve only for format compatibility and baseline comparison, not as load-bearing justification for the new benchmark's properties. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper's central contribution is dataset curation and release rather than a theoretical derivation, resting primarily on domain assumptions about data suitability and compatibility with existing pipelines.

axioms (2)

domain assumption VoxMM contains human-annotated transcriptions from which AVSR-suitable samples can be reliably selected.
The paper relies on VoxMM as the source dataset with annotations for deriving the benchmark.
domain assumption LRS-style formatting enables direct compatibility with existing AVSR models and pipelines.
The abstract states the preprocessing is done 'for direct use in existing AVSR pipelines'.

pith-pipeline@v0.9.0 · 5466 in / 1486 out tokens · 61316 ms · 2026-05-07T05:36:44.583867+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Introduction Audio-visual speech recognition (A VSR) has advanced sub- stantially in recent years with the development of deep neural models and the availability of large-scale datasets [1–3]. Pub- lic benchmarks such as LRW [4], LRS2 [5], and especially LRS3 [6] have played an important role in this progress by enabling standardized evaluation and reprod...
[2]

LRS-VoxMM benchmark V oxMM [9] is a multimodal conversational corpus that provides audio, video, transcripts, speaker labels, face tracks, and other metadata. Unlike conventional audio-visual corpora organized around pre-segmented utterances, it was designed to annotate entire videos as completely as possible, including overlapping speech, off-screen spee...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Baselines and checkpoints We report results for audio-only ASR, A VSR, and visual speech recognition (VSR) using representative released baselines

Experiments 3.1. Baselines and checkpoints We report results for audio-only ASR, A VSR, and visual speech recognition (VSR) using representative released baselines. Un- less otherwise noted, all results are obtained from publicly available official checkpoints under their original training con- figurations. Since these baselines were trained with differ- ...

1932
[4]

Conclusion We present LRS-V oxMM, a benchmark-ready in-the-wild A VSR dataset derived from V oxMM and released in a format compatible with existing LRS-based pipelines. By curating A VSR-suitable samples and standardizing the preprocessing, transcript format, and file structure, we make a diverse real- world resource more accessible as a common benchmark....
[5]

Learning audio-visual speech representation by masked multimodal cluster prediction,

B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio-visual speech representation by masked multimodal cluster prediction,” inProc. ICLR, 2022

2022
[6]

Auto-A VSR: Audio-visual speech recognition with automatic labels,

P. Ma, A. Haliassos, A. Fernandez-Lopez, H. Chen, S. Petridis, and M. Pantic, “Auto-A VSR: Audio-visual speech recognition with automatic labels,” inProc. ICASSP. IEEE, 2023, pp. 1– 5

2023
[7]

Large language models are strong audio-visual speech recognition learners,

U. Cappellazzo, M. Kim, H. Chen, P. Ma, S. Petridis, D. Falavi- gna, A. Brutti, and M. Pantic, “Large language models are strong audio-visual speech recognition learners,” inProc. ICASSP. IEEE, 2025, pp. 1–5

2025
[8]

Lip reading in the wild,

J. S. Chung and A. Zisserman, “Lip reading in the wild,” inProc. ACCV, 2016

2016
[9]

Deep audio-visual speech recognition,

T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,” inIEEE Transactions on Pattern Analysis and Machine Intelligence, 2018

2018
[10]

LRS3-TED: a large- scale dataset for visual speech recognition,

T. Afouras, J. S. Chung, and A. Zisserman, “LRS3-TED: a large- scale dataset for visual speech recognition,” inarXiv preprint arXiv:1809.00496, 2018

work page arXiv 2018
[11]

Do vsr models generalize beyond LRS3?

Y . A. D. Djilali, S. Narayan, E. LeBihan, H. Boussaid, E. Al- mazrouei, and M. Debbah, “Do vsr models generalize beyond LRS3?” inProc. WACV, 2024

2024
[12]

Muavic: A multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation,

M. Anwar, B. Shi, V . Goswami, W.-N. Hsu, J. Pino, and C. Wang, “Muavic: A multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation,” inProc. Inter- speech, 2023, pp. 4064–4068

2023
[13]

V oxMM: Rich transcription of conversations in the wild,

D. Kwak, J. Jung, K. Nam, Y . Jang, J.-W. Jung, S. Watanabe, and J. S. Chung, “V oxMM: Rich transcription of conversations in the wild,” inProc. ICASSP, 2024

2024
[14]

wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,” inNeurIPS, 2020

2020
[15]

EDNet: A versatile speech enhancement framework with gating mamba mechanism and phase shift-invariant training,

D. Kwak, Y . Jang, S. Kim, and J. S. Chung, “EDNet: A versatile speech enhancement framework with gating mamba mechanism and phase shift-invariant training,”IEEE Transactions on Audio, Speech and Language Processing, 2026

2026
[16]

The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings,

J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings,” inProceedings of Meetings on Acoustics, vol. 19, no. 1. Acoustical Society of America, 2013, p. 035081

2013
[17]

Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring,

J. Hong, M. Kim, J. Choi, and Y . M. Ro, “Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring,” inProc. CVPR, 2023

2023
[18]

Whisper-flamingo: Integrating visual fea- tures into whisper for audio-visual speech recognition and trans- lation,

A. Rouditchenko, Y . Gong, S. Thomas, L. Karlinsky, H. Kuehne, R. Feris, and J. Glass, “Whisper-flamingo: Integrating visual fea- tures into whisper for audio-visual speech recognition and trans- lation,” inProc. Interspeech, 2024

2024
[19]

Multi-task cor- rupted prediction for learning robust audio-visual speech repre- sentation,

S. Kim, S. Cho, S. Bae, K. Jang, and S.-Y . Yun, “Multi-task cor- rupted prediction for learning robust audio-visual speech repre- sentation,” inProc. ICLR, 2025

2025
[20]

V oxCeleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep speaker recognition,” inProc. Interspeech, 2018

2018
[21]

Looking to listen at the cock- tail party: A speaker-independent audio-visual model for speech separation,

A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cock- tail party: A speaker-independent audio-visual model for speech separation,” inProc. ACM SIGGRAPH, vol. 37, no. 4, 2018, pp. 1–13

2018
[22]

Robust speech recognition via large-scale weak su- pervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inProc. ICML, 2023

2023