M2S-AVSR: Modality-aware Multi-view Self-supervised Representation for Robust Audio-Visual Speech Recognition

Cancan Li; Fei Su; Juan Liu; Ming Li

arxiv: 2606.05763 · v1 · pith:27BENKZVnew · submitted 2026-06-04 · 📡 eess.AS · cs.SD

M2S-AVSR: Modality-aware Multi-view Self-supervised Representation for Robust Audio-Visual Speech Recognition

Fei Su , Cancan Li , Juan Liu , Ming Li This is my paper

Pith reviewed 2026-06-27 23:56 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords audio-visual speech recognitionself-supervised learningmulti-view representationmodality-aware fusionview-invariant featuresrobust AVSRreal-scene datasetLRS3 benchmark

0 comments

The pith

A modality-aware multi-view self-supervised framework learns view-invariant visual speech features and quality-aware fusion to boost robustness in audio-visual speech recognition under real-world distortions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes M2S-AVSR to handle viewpoint variation, audio distortion, and visual occlusion that degrade audio-visual speech recognition. It trains a multi-view encoder to extract view-invariant visual representations and adds a modality-aware module that scores quality and synchrony to guide fine-grained fusion during decoding. The approach is tested on English and Mandarin benchmarks plus a newly collected multi-scenario dataset, showing large gains precisely when inputs are perturbed or degraded. The central goal is to make visual cues reliable enough to improve recognition outside studio conditions.

Core claim

The M2S-AVSR framework first trains a multi-view representation learning encoder to produce view-invariant visual speech features, then applies a modality-aware module that explicitly estimates modality quality and cross-modal synchrony to control fine-grained visual information injection during decoding. On LRS3 this yields up to 29.4 percent relative improvement under viewpoint perturbation and visual degradation; the method sets a new state-of-the-art on the MISP2021-AVSR test set and records the best result among compared systems in outdoor scenes on the introduced AISHELL8-RealScene dataset.

What carries the argument

The multi-view representation learning encoder paired with the modality-aware fusion module, which together generate view-invariant features and selectively inject visual information according to estimated quality and synchrony.

If this is right

Up to 29.4 percent relative word-error-rate reduction occurs on LRS3 when viewpoint and visual quality are perturbed.
New state-of-the-art performance is reached on the MISP2021-AVSR test set.
Best reported results among compared systems appear in outdoor scenes of the AISHELL8-RealScene benchmark.
The framework supplies explicit support for conversational speech recognition in realistic multi-view environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the view-invariant encoder generalizes, the same pre-training strategy could be applied to lip-reading systems that must handle arbitrary camera placements.
Explicit synchrony modeling may reduce errors in remote-conferencing settings where audio and video arrive with variable delay.
Publication of the multi-scenario dataset may accelerate development of outdoor and multi-view audio-visual benchmarks.
The quality-aware fusion idea could transfer to other multimodal tasks such as audio-visual event localization under occlusion.

Load-bearing premise

View-invariant visual features produced by the encoder remain useful when fused with audio even under real-world asynchrony and occlusion.

What would settle it

A new test set containing extreme camera angles and heavy visual occlusion on which the reported word-error-rate reductions disappear or reverse would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.05763 by Cancan Li, Fei Su, Juan Liu, Ming Li.

**Figure 2.** Figure 2: Overview of the proposed M2S-AVSR framework. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Detailed structure of the modality-aware module. The upper branch [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the AISHELL8-RealScene recording scenes. (a) Schematic illustration of the recording scenes, where [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution analysis of AISHELL8-RealScene dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Occlusion-based lip sensitivity visualization for self-supervised multi [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of modality-aware fusion on the MISP2021 dataset. Test sample: R16 [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Evaluation on multi-view data (VSR WER (%)). “30h” and “433h” [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

Audio-Visual Speech Recognition (AVSR) enhances speech recognition robustness by leveraging visual cues, while real-world scenarios remain challenging due to viewpoint variation, audio distortion, and visual occlusion, which degrade modality quality and increase audio-visual asynchrony. In this paper, we propose a novel Modality-aware Multi-view Self-supervised representation framework for robust Audio-Visual Speech Recognition (M2S-AVSR). First, we introduce a multi-view representation learning encoder to learn view-invariant visual speech representations. Next, we employ a modality-aware module that explicitly models modality quality and cross-modal synchrony to perform fine-grained modality-aware fusion, enabling fine-grained visual information injection during decoding. In addition, we present AISHELL8-RealScene, a public multi-scenario, multi-view conversational audio-visual dataset recorded in real-world environments, and establish a speech recognition benchmark on it. Experiments on English and Mandarin benchmarks demonstrate the effectiveness of the proposed method under challenging conditions. On LRS3, M2S-AVSR achieves up to 29.4% relative improvement under viewpoint perturbation and visual degradation settings. Our method also achieves new state-of-the-art performance on the MISP2021-AVSR test set. On AISHELL8-RealScene, it achieves the best result in outdoor scenes. The proposed method and dataset provide useful support for future research on robust speech and multimodal tasks under realistic conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

M2S-AVSR pairs a multi-view self-supervised visual encoder with explicit modality-quality and synchrony modeling, plus a new real-scene multi-view dataset, and shows solid gains on perturbed LRS3 and SOTA on MISP2021.

read the letter

The main takeaway is that this paper gives a concrete way to handle viewpoint shifts and modality dropouts in AVSR by learning view-invariant visual features first, then fusing audio and video with an explicit quality and timing check. They also release AISHELL8-RealScene, a conversational dataset recorded in actual outdoor and indoor settings.

The dataset is the clearest addition. Existing AVSR benchmarks are mostly studio or single-view, so a public multi-view real-scene collection fills a practical gap and lets people test outdoor performance directly. The reported 29.4% relative WER drop on LRS3 under viewpoint and visual degradation, plus the new best number on MISP2021, line up with the stated goals.

The architecture itself is straightforward and avoids obvious circularity. The multi-view encoder targets invariance, the modality module injects fine-grained visual cues only when they are reliable, and the numbers are consistent with that mechanism. No load-bearing assumption collapses on inspection.

The soft spot is the usual one for this style of work: without the full ablations it is hard to separate how much comes from the self-supervised pretraining versus the fusion module or just longer training. That said, the stress-test found no internal contradiction, so the gap is incremental rather than fatal.

This is worth a serious referee for the AVSR robustness crowd. The dataset alone gives it enough substance to justify review time even if the gains need tighter controls in revision.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes M2S-AVSR, a modality-aware multi-view self-supervised framework for robust audio-visual speech recognition. It introduces a multi-view representation learning encoder to produce view-invariant visual speech features and a modality-aware fusion module that explicitly models modality quality and cross-modal synchrony for fine-grained fusion during decoding. The authors also release the AISHELL8-RealScene dataset (multi-scenario, multi-view conversational recordings in real-world environments) and report up to 29.4% relative improvement on LRS3 under viewpoint perturbation and visual degradation, new SOTA on the MISP2021-AVSR test set, and best results in outdoor scenes on the new dataset.

Significance. If the experimental results hold under rigorous controls, the work would advance robust AVSR by directly targeting real-world degradations (viewpoint variation, occlusion, asynchrony) via self-supervised view-invariant features and quality-aware fusion. The public release of AISHELL8-RealScene provides a concrete, falsifiable benchmark for future multimodal speech research in realistic conditions.

minor comments (2)

The abstract reports relative improvements and SOTA claims without absolute WER numbers or explicit baseline comparisons; adding these (e.g., in a results table) would strengthen verifiability of the 29.4% figure.
The description of the multi-view encoder and modality-aware module is high-level; the main text should include precise architectural diagrams, loss formulations, and training hyperparameters to support reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the significance for robust AVSR under real-world degradations, and recommendation for minor revision. The manuscript presents M2S-AVSR with multi-view self-supervised visual encoding and modality-aware fusion, releases the AISHELL8-RealScene dataset, and reports gains on LRS3, MISP2021, and the new dataset. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes an architectural framework (multi-view encoder for view-invariant features + modality-aware fusion) and reports empirical gains on LRS3 perturbations and MISP2021. No derivation chain, equation, or result is shown to reduce by construction to fitted parameters, self-citations, or renamed inputs. Performance numbers are presented as experimental outcomes on external benchmarks and a newly introduced dataset, with no load-bearing self-referential loops or uniqueness theorems invoked from prior author work. The derivation is self-contained against external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, training objectives, or implementation details; therefore no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5794 in / 1159 out tokens · 18238 ms · 2026-06-27T23:56:51.897262+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

93 extracted references · 9 canonical work pages · 4 internal anchors

[1]

Speech recognition with deep recurrent neural networks,

A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” inProc. ICASSP, 2013, pp. 6645–6649

2013
[2]

Joint ctc-attention based end-to-end speech recognition using multi-task learning,

S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” inProc. ICASSP, 2017, pp. 4835–4839

2017
[3]

E2e-sincnet: Toward fully end-to-end speech recognition,

T. Parcollet, M. Morchid, and G. Linares, “E2e-sincnet: Toward fully end-to-end speech recognition,” inProc. ICASSP, 2020, pp. 7714–7718. 13

2020
[4]

Complex spectral mapping for single-and multi-channel speech enhancement and robust asr,

Z.-Q. Wang, P. Wang, and D. Wang, “Complex spectral mapping for single-and multi-channel speech enhancement and robust asr,” IEEE/ACM transactions on audio, speech, and language processing, vol. 28, pp. 1778–1787, 2020

2020
[5]

Towards efficient models for real-time deep noise suppression,

S. Braun, H. Gamper, C. K. Reddy, and I. Tashev, “Towards efficient models for real-time deep noise suppression,” inProc. ICASSP. IEEE, 2021, pp. 656–660

2021
[6]

Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,

K. Tan and D. Wang, “Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 380–390, 2019

2019
[7]

End-to-end audiovisual speech recognition,

S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic, “End-to-end audiovisual speech recognition,” inProc. ICASSP. IEEE, 2018, pp. 6548–6552

2018
[8]

Visual context-driven audio feature enhancement for robust end-to-end audio-visual speech recognition,

J. Hong, M. Kim, D. Yoo, and Y . M. Ro, “Visual context-driven audio feature enhancement for robust end-to-end audio-visual speech recognition,” inProc. Interspeech, 2022, pp. 2838–2842

2022
[9]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017
[10]

Conformer: Convolution-augmented transformer for speech recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wuet al., “Conformer: Convolution-augmented transformer for speech recognition,”arXiv:2005.08100, 2020

work page arXiv 2005
[11]

Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” inProc. ICML, 2006, pp. 369–376

2006
[12]

Deep audio-visual speech recognition,

T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 12, pp. 8717–8727, 2018

2018
[13]

Learning contextually fused audio-visual representations for audio- visual speech recognition,

Z.-Q. Zhang, J. Zhang, J.-S. Zhang, M.-H. Wu, X. Fang, and L.-R. Dai, “Learning contextually fused audio-visual representations for audio- visual speech recognition,” inProc. ICIP. IEEE, 2022, pp. 1346–1350

2022
[14]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, 2020, pp. 12 449–12 460

2020
[15]

Wavlm: Large-scale self-supervised pre- training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022
[16]

Robust speech recognition via large-scale weak super- vision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak super- vision,” inProc. ICML, 2023, pp. 28 492–28 518

2023
[17]

Robust Self-Supervised Audio- Visual Speech Recognition,

B. Shi, W.-N. Hsu, and A. Mohamed, “Robust Self-Supervised Audio- Visual Speech Recognition,” inProc. Interspeech, 2022, pp. 2118–2122

2022
[18]

Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition,

X. Pan, P. Chen, Y . Gong, H. Zhou, X. Wang, and Z. Lin, “Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition,” inProc. ACL, 2022, pp. 4491–4503

2022
[19]

Self-supervised adaptive av fusion module for pre-trained asr models,

C. Simic and T. Bocklet, “Self-supervised adaptive av fusion module for pre-trained asr models,” inProc. ICASSP. IEEE, 2024, pp. 12 787– 12 791

2024
[20]

Whisper-flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation,

A. Rouditchenko, Y . Gong, S. Thomas, L. Karlinsky, H. Kuehne, R. Feris, and J. Glass, “Whisper-flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation,” in Proc. Interspeech, 2024, pp. 2420–2424

2024
[21]

Where visual speech meets language: Vsp-llm framework for efficient and context-aware visual speech processing,

J. Yeo, S. Hanet al., “Where visual speech meets language: Vsp-llm framework for efficient and context-aware visual speech processing,” in Proc. EMNLP, 2024, pp. 11 391–11 406

2024
[22]

Large language models are strong audio-visual speech recognition learners,

U. Cappellazzo, M. Kim, H. Chen, P. Maet al., “Large language models are strong audio-visual speech recognition learners,” inProc. ICASSP, 2025, pp. 1–5

2025
[23]

Mms-llama: Efficient llm- based audio-visual speech recognition with minimal multimodal speech tokens,

J. H. Yeo, H. Rha, S. J. Park, and Y . M. Ro, “Mms-llama: Efficient llm- based audio-visual speech recognition with minimal multimodal speech tokens,” inProc. ACL, 2025, pp. 20 724–20 735

2025
[24]

LRS3-TED: a large-scale dataset for visual speech recognition

T. Afouras, J. S. Chung, and A. Zisserman, “Lrs3-ted: a large-scale dataset for visual speech recognition,”arXiv preprint arXiv:1809.00496, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Purification before fusion: Toward mask-free speech enhance- ment for robust audio-visual speech recognition,

L. Wu, X. Zhang, H. Yuan, Y . Zhang, C. Zheng, L. Xie, T. Liu, and E. Yin, “Purification before fusion: Toward mask-free speech enhance- ment for robust audio-visual speech recognition,” inProc. ICASSP, 2026, pp. 17 932–17 936

2026
[26]

Attention bottlenecks for multimodal fusion,

A. Nagrani, S. Yanget al., “Attention bottlenecks for multimodal fusion,”Advances in Neural Information Processing Systems, vol. 34, 2021

2021
[27]

Cross-modal attention network for temporal inconsistent audio-visual event localization,

H. Xuan, Z. Zhanget al., “Cross-modal attention network for temporal inconsistent audio-visual event localization,” inProc. AAAI, vol. 34, no. 01, 2020, pp. 279–286

2020
[28]

Modality attention for end-to-end audio-visual speech recognition,

P. Zhou, W. Yang, W. Chen, Y . Wang, and J. Jia, “Modality attention for end-to-end audio-visual speech recognition,” inProc. ICASSP. IEEE, 2019, pp. 6565–6569

2019
[29]

Robust audio-visual speech recognition using bimodal dfsmn with multi-condition training and dropout regulariza- tion,

S. Zhang, M. Leiet al., “Robust audio-visual speech recognition using bimodal dfsmn with multi-condition training and dropout regulariza- tion,” inProc. ICASSP, 2019, pp. 6570–6574

2019
[30]

Training strategies to handle missing modalities for audio-visual expression recognition,

S. Parthasarathy and S. Sundaram, “Training strategies to handle missing modalities for audio-visual expression recognition,” inProc. ICMI, 2020, pp. 400–404

2020
[31]

Enhanced self-supervised multi-view rep- resentations with modality-missing robustness for audio-visual speech recognition,

F. Su, C. Li, and J. Liu, “Enhanced self-supervised multi-view rep- resentations with modality-missing robustness for audio-visual speech recognition,” inProc. ICME, 2025, pp. 1–6

2025
[32]

Audio-visual deep learning for noise robust speech recognition,

J. Huang and B. Kingsbury, “Audio-visual deep learning for noise robust speech recognition,” inProc. ICASSP. IEEE, 2013, pp. 7596–7599

2013
[33]

Deep multimodal learning for audio-visual speech recognition,

Y . Mroueh, E. Marcheret, and V . Goel, “Deep multimodal learning for audio-visual speech recognition,” inProc. ICASSP. IEEE, 2015, pp. 2130–2134

2015
[34]

End-to-end audio-visual speech recognition with conformers,

P. Ma, S. Petridis, and M. Pantic, “End-to-end audio-visual speech recognition with conformers,” inProc. ICASSP, 2021, pp. 7613–7617

2021
[35]

Lip reading sentences in the wild,

J. Son Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” inProc. CVPR, 2017, pp. 6447–6456

2017
[36]

Zero- avsr: Zero-shot audio-visual speech recognition with llms by learning language-agnostic speech representations,

J. H. Yeo, M. Kim, C. W. Kim, S. Petridis, and Y . M. Ro, “Zero- avsr: Zero-shot audio-visual speech recognition with llms by learning language-agnostic speech representations,” inProc. ICCV, 2025, pp. 6693–6703

2025
[37]

Mitigating at- tention sinks and massive activations in audio-visual speech recognition with llms,

A. Anand, U. Cappellazzo, S. Petridis, and M. Pantic, “Mitigating at- tention sinks and massive activations in audio-visual speech recognition with llms,” inProc. ICASSP. IEEE, 2026, pp. 17 942–17 946

2026
[38]

Multi- angle lipreading using angle classification and angle-specific feature integration,

S. Isobe, S. Tamura, S. Hayamizu, Y . Gotoh, and M. Nose, “Multi- angle lipreading using angle classification and angle-specific feature integration,” inProc. ICCSPA. IEEE, 2021, pp. 1–5

2021
[39]

Efficient multi-angle audio-visual speech recognition using parallel wavegan based scene classifier

S. Isobe, S. Tamura, Y . Gotoh, and M. Nose, “Efficient multi-angle audio-visual speech recognition using parallel wavegan based scene classifier.” inProc. ICPRAM, 2022, pp. 449–460

2022
[40]

Audio-visual speech recognition in-the-wild: Multi-angle vehicle cabin corpus and attention-based method,

A. Axyonov, D. Ryumin, D. Ivanko, A. Kashevnik, and A. Karpov, “Audio-visual speech recognition in-the-wild: Multi-angle vehicle cabin corpus and attention-based method,” inProc. ICASSP. IEEE, 2024, pp. 8195–8199

2024
[41]

Multi-view spectral clustering with adaptive graph learning and tensor schatten p-norm,

Y . Zhao, Y . Yun, X. Zhang, Q. Li, and Q. Gao, “Multi-view spectral clustering with adaptive graph learning and tensor schatten p-norm,” Neurocomputing, vol. 468, pp. 257–264, 2022

2022
[42]

Umiformer: Mining the correlations between similar tokens for multi-view 3d reconstruction,

Z. Zhu, L. Yang, N. Li, C. Jiang, and Y . Liang, “Umiformer: Mining the correlations between similar tokens for multi-view 3d reconstruction,” inProc. ICCV, 2023, pp. 18 226–18 235

2023
[43]

Vsformer: Mining correlations in flexible view set for multi-view 3d shape under- standing,

H. Sun, Y . Wang, P. Wang, H. Deng, X. Cai, and D. Li, “Vsformer: Mining correlations in flexible view set for multi-view 3d shape under- standing,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 4, pp. 2127–2141, 2024

2024
[44]

Viewclr: Learning self-supervised video representation for unseen viewpoints,

S. Das and M. S. Ryoo, “Viewclr: Learning self-supervised video representation for unseen viewpoints,” inProc. WACV, 2023, pp. 5573– 5583

2023
[45]

Injecting visual features into whisper for parameter-efficient noise- robust audio-visual speech recognition,

Z. Yang, Y . H. Yeo, R. Jiang, X. Fu, W. Chen, W. Xi, and J. Zhao, “Injecting visual features into whisper for parameter-efficient noise- robust audio-visual speech recognition,” inProc. ICASSP, 2025, pp. 1–5

2025
[46]

Audio-visual representa- tion learning via knowledge distillation from speech foundation models,

J.-X. Zhang, G. Wan, J. Gao, and Z.-H. Ling, “Audio-visual representa- tion learning via knowledge distillation from speech foundation models,” Pattern Recognition, vol. 162, p. 111432, 2025

2025
[47]

Improving noise robust audio-visual speech recognition via router-gated cross- modal feature fusion,

D. Lim, Y . Kim, D.-H. Kim, D.-H. Yang, and J.-H. Chang, “Improving noise robust audio-visual speech recognition via router-gated cross- modal feature fusion,” inProc. ASRU, 2025, pp. 1–7

2025
[48]

Multimodal machine learning: A survey and taxonomy,

T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,”IEEE transactions on pattern anal- ysis and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018

2018
[49]

Audio-visual speech recognition with a hybrid ctc/attention architec- ture,

S. Petridis, T. Stafylakis, P. Ma, G. Tzimiropoulos, and M. Pantic, “Audio-visual speech recognition with a hybrid ctc/attention architec- ture,” inProc. SLT. IEEE, 2018, pp. 513–520

2018
[50]

Disentan- gled speech embeddings using cross-modal self-supervision,

A. Nagrani, J. S. Chung, S. Albanie, and A. Zisserman, “Disentan- gled speech embeddings using cross-modal self-supervision,” inProc. ICASSP, 2020, pp. 6829–6833

2020
[51]

Learning audio- visual speech representation by masked multimodal cluster prediction,

B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio- visual speech representation by masked multimodal cluster prediction,” inProc. ICLR, 2022

2022
[52]

Videobert: A joint model for video and language representation learning,

C. Sun, A. Myers, C. V ondrick, K. Murphy, and C. Schmid, “Videobert: A joint model for video and language representation learning,” inProc. ICCV, 2019, pp. 7464–7473. 14

2019
[53]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProc. ICML. PmLR, 2021, pp. 8748–8763

2021
[54]

Flamingo: a visual language model for few-shot learning,

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millicah, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,” inProc. NeurIPS, 2022, pp. 23 716–23 736

2022
[55]

Out of time: automated lip sync in the wild,

J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” inProc. ACCV, 2016, pp. 251–263

2016
[56]

End-to-end audio-visual neural speaker diarization,

M.-K. He, J. Du, and C.-H. Lee, “End-to-end audio-visual neural speaker diarization,” inProc. Interspeech, 2022, pp. 1461–1465

2022
[57]

Combining multiple probability predictions using a simple logit model,

V . A. Satop ¨a¨a, J. Baron, D. P. Foster, B. A. Mellers, P. E. Tetlock, and L. H. Ungar, “Combining multiple probability predictions using a simple logit model,”International Journal of Forecasting, vol. 30, no. 2, pp. 344–356, 2014

2014
[58]

On giant’s shoulders: Effortless weak to strong by dynamic logits fusion,

C. Fan, Z. Lu, W. Wei, J. Tian, X. Qu, D. Chen, and Y . Cheng, “On giant’s shoulders: Effortless weak to strong by dynamic logits fusion,”Advances in Neural Information Processing Systems, vol. 37, pp. 29 986–30 014, 2024

2024
[59]

Hy- brid ctc/attention architecture for end-to-end speech recognition,

S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hy- brid ctc/attention architecture for end-to-end speech recognition,”IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017

2017
[60]

A cascade sequence-to-sequence model for chinese mandarin lip reading,

Y . Zhao, R. Xu, and M. Song, “A cascade sequence-to-sequence model for chinese mandarin lip reading,” inProc. ACM Multimedia Asia, 2019, pp. 1–6

2019
[61]

Audio-visual speech recognition in misp2021 challenge: Dataset release and deep analysis,

H. Chen, J. Du, Y . Dai, S. Siniscalchi, S. Watanabe, O. Scharenborg, J. Chen, J. Panet al., “Audio-visual speech recognition in misp2021 challenge: Dataset release and deep analysis,” inProc. Interspeech, vol. 2022, 2022, pp. 1766–1770

2022
[62]

Cn-cvs: A mandarin audio-visual dataset for large vocabulary continuous visual to speech synthesis,

C. Chen, D. Wang, and T. F. Zheng, “Cn-cvs: A mandarin audio-visual dataset for large vocabulary continuous visual to speech synthesis,” in Proc. ICASSP, 2023, pp. 1–5

2023
[63]

Muavic: A multilingual audio-visual corpus for robust speech recogni- tion and robust speech-to-text translation,

M. Anwar, B. Shi, V . Goswami, W.-N. Hsu, J. M. Pino, and C. Wang, “Muavic: A multilingual audio-visual corpus for robust speech recogni- tion and robust speech-to-text translation,”arXiv:2303.00628, 2023

work page arXiv 2023
[64]

Ave speech: A comprehensive multimodal dataset for speech recognition integrating audio, visual, and electromyographic signals,

D. Zhou, Y . Zhang, J. Wu, X. Zhang, L. Xie, and E. Yin, “Ave speech: A comprehensive multimodal dataset for speech recognition integrating audio, visual, and electromyographic signals,”IEEE Transactions on Human-Machine Systems, vol. 55, no. 4, pp. 559–568, 2025

2025
[65]

Sample and computation redistribution for efficient face detection,

J. Guo, J. Deng, A. Lattas, and S. Zafeiriou, “Sample and computation redistribution for efficient face detection,”arXiv:2105.04714, 2021

work page arXiv 2021
[66]

Arcface: Additive angular margin loss for deep face recognition,

J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” inProc. CVPR, 2019, pp. 4685– 4694

2019
[67]

The Faiss library

M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazar´e, M. Lomeli, L. Hosseini, and H. J ´egou, “The faiss library,” arXiv:2401.08281, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

Billion-scale similarity search with GPUs,

J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535– 547, 2019

2019
[69]

V oxCeleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep speaker recognition,” inProc. Interspeech, 2018, pp. 1086–1090

2018
[70]

The first multimodal information based speech processing (misp) challenge: Data, tasks, baselines and results,

H. Chen, H. Zhou, D. Jun, C.-H. Lee, J. Chen, S. Watanabe, S. M. Siniscalchi, O. Scharenborg, D.-Y . Liu, B.-C. Yinet al., “The first multimodal information based speech processing (misp) challenge: Data, tasks, baselines and results,” inProc. ICASSP, 2022, pp. 9266–9270

2022
[71]

Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis,

I. Anina, Z. Zhouet al., “Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis,” inProc. FG, 2015, pp. 1–5

2015
[72]

Recurrent neural network transducer for audio-visual speech recognition,

T. Makino, H. Liaoet al., “Recurrent neural network transducer for audio-visual speech recognition,” inProc. ASRU, 2019, pp. 905–912

2019
[73]

Auto-avsr: Audio-visual speech recognition with automatic labels,

P. Ma, A. Haliassoset al., “Auto-avsr: Audio-visual speech recognition with automatic labels,” inProc. ICASSP, 2023, pp. 1–5

2023
[74]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,”arXiv preprint arXiv:1510.08484, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[75]

Nara- wpe: A python package for weighted prediction error dereverberation in numpy and tensorflow for online and offline processing,

L. Drude, J. Heymann, C. Boeddeker, and R. Haeb-Umbach, “Nara- wpe: A python package for weighted prediction error dereverberation in numpy and tensorflow for online and offline processing,” inSpeech Communication; 13th ITG-Symposium. VDE, 2018, pp. 1–5

2018
[76]

Front-end processing for the chime-5 dinner party scenario,

C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Hey- mann, and R. Haeb-Umbach, “Front-end processing for the chime-5 dinner party scenario,” inProc. CHiME 2018, 2018, pp. 35–40

2018
[77]

Specaugment: A simple data augmentation method for automatic speech recognition,

D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” inProc. Interspeech, 2019, pp. 2613– 2617

2019
[78]

Lipreading using temporal convolutional networks,

B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading using temporal convolutional networks,” inProc. ICASSP, 2020, pp. 6319– 6323

2020
[79]

Watch or listen: Robust audio- visual speech recognition with visual corruption modeling and reliability scoring,

J. Hong, M. Kim, J. Choi, and Y . M. Ro, “Watch or listen: Robust audio- visual speech recognition with visual corruption modeling and reliability scoring,” inProc. CVPR, 2023, pp. 18 783–18 794

2023
[80]

Learning video temporal dynamics with cross-modal attention for robust audio-visual speech recognition,

S. Kim, K. Jang, S. Bae, H. Kim, and S.-Y . Yun, “Learning video temporal dynamics with cross-modal attention for robust audio-visual speech recognition,” inProc. SLT, 2024, pp. 447–454

2024

Showing first 80 references.

[1] [1]

Speech recognition with deep recurrent neural networks,

A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” inProc. ICASSP, 2013, pp. 6645–6649

2013

[2] [2]

Joint ctc-attention based end-to-end speech recognition using multi-task learning,

S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” inProc. ICASSP, 2017, pp. 4835–4839

2017

[3] [3]

E2e-sincnet: Toward fully end-to-end speech recognition,

T. Parcollet, M. Morchid, and G. Linares, “E2e-sincnet: Toward fully end-to-end speech recognition,” inProc. ICASSP, 2020, pp. 7714–7718. 13

2020

[4] [4]

Complex spectral mapping for single-and multi-channel speech enhancement and robust asr,

Z.-Q. Wang, P. Wang, and D. Wang, “Complex spectral mapping for single-and multi-channel speech enhancement and robust asr,” IEEE/ACM transactions on audio, speech, and language processing, vol. 28, pp. 1778–1787, 2020

2020

[5] [5]

Towards efficient models for real-time deep noise suppression,

S. Braun, H. Gamper, C. K. Reddy, and I. Tashev, “Towards efficient models for real-time deep noise suppression,” inProc. ICASSP. IEEE, 2021, pp. 656–660

2021

[6] [6]

Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,

K. Tan and D. Wang, “Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 380–390, 2019

2019

[7] [7]

End-to-end audiovisual speech recognition,

S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic, “End-to-end audiovisual speech recognition,” inProc. ICASSP. IEEE, 2018, pp. 6548–6552

2018

[8] [8]

Visual context-driven audio feature enhancement for robust end-to-end audio-visual speech recognition,

J. Hong, M. Kim, D. Yoo, and Y . M. Ro, “Visual context-driven audio feature enhancement for robust end-to-end audio-visual speech recognition,” inProc. Interspeech, 2022, pp. 2838–2842

2022

[9] [9]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017

[10] [10]

Conformer: Convolution-augmented transformer for speech recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wuet al., “Conformer: Convolution-augmented transformer for speech recognition,”arXiv:2005.08100, 2020

work page arXiv 2005

[11] [11]

Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” inProc. ICML, 2006, pp. 369–376

2006

[12] [12]

Deep audio-visual speech recognition,

T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 12, pp. 8717–8727, 2018

2018

[13] [13]

Learning contextually fused audio-visual representations for audio- visual speech recognition,

Z.-Q. Zhang, J. Zhang, J.-S. Zhang, M.-H. Wu, X. Fang, and L.-R. Dai, “Learning contextually fused audio-visual representations for audio- visual speech recognition,” inProc. ICIP. IEEE, 2022, pp. 1346–1350

2022

[14] [14]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, 2020, pp. 12 449–12 460

2020

[15] [15]

Wavlm: Large-scale self-supervised pre- training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022

[16] [16]

Robust speech recognition via large-scale weak super- vision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak super- vision,” inProc. ICML, 2023, pp. 28 492–28 518

2023

[17] [17]

Robust Self-Supervised Audio- Visual Speech Recognition,

B. Shi, W.-N. Hsu, and A. Mohamed, “Robust Self-Supervised Audio- Visual Speech Recognition,” inProc. Interspeech, 2022, pp. 2118–2122

2022

[18] [18]

Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition,

X. Pan, P. Chen, Y . Gong, H. Zhou, X. Wang, and Z. Lin, “Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition,” inProc. ACL, 2022, pp. 4491–4503

2022

[19] [19]

Self-supervised adaptive av fusion module for pre-trained asr models,

C. Simic and T. Bocklet, “Self-supervised adaptive av fusion module for pre-trained asr models,” inProc. ICASSP. IEEE, 2024, pp. 12 787– 12 791

2024

[20] [20]

Whisper-flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation,

A. Rouditchenko, Y . Gong, S. Thomas, L. Karlinsky, H. Kuehne, R. Feris, and J. Glass, “Whisper-flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation,” in Proc. Interspeech, 2024, pp. 2420–2424

2024

[21] [21]

Where visual speech meets language: Vsp-llm framework for efficient and context-aware visual speech processing,

J. Yeo, S. Hanet al., “Where visual speech meets language: Vsp-llm framework for efficient and context-aware visual speech processing,” in Proc. EMNLP, 2024, pp. 11 391–11 406

2024

[22] [22]

Large language models are strong audio-visual speech recognition learners,

U. Cappellazzo, M. Kim, H. Chen, P. Maet al., “Large language models are strong audio-visual speech recognition learners,” inProc. ICASSP, 2025, pp. 1–5

2025

[23] [23]

Mms-llama: Efficient llm- based audio-visual speech recognition with minimal multimodal speech tokens,

J. H. Yeo, H. Rha, S. J. Park, and Y . M. Ro, “Mms-llama: Efficient llm- based audio-visual speech recognition with minimal multimodal speech tokens,” inProc. ACL, 2025, pp. 20 724–20 735

2025

[24] [24]

LRS3-TED: a large-scale dataset for visual speech recognition

T. Afouras, J. S. Chung, and A. Zisserman, “Lrs3-ted: a large-scale dataset for visual speech recognition,”arXiv preprint arXiv:1809.00496, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

Purification before fusion: Toward mask-free speech enhance- ment for robust audio-visual speech recognition,

L. Wu, X. Zhang, H. Yuan, Y . Zhang, C. Zheng, L. Xie, T. Liu, and E. Yin, “Purification before fusion: Toward mask-free speech enhance- ment for robust audio-visual speech recognition,” inProc. ICASSP, 2026, pp. 17 932–17 936

2026

[26] [26]

Attention bottlenecks for multimodal fusion,

A. Nagrani, S. Yanget al., “Attention bottlenecks for multimodal fusion,”Advances in Neural Information Processing Systems, vol. 34, 2021

2021

[27] [27]

Cross-modal attention network for temporal inconsistent audio-visual event localization,

H. Xuan, Z. Zhanget al., “Cross-modal attention network for temporal inconsistent audio-visual event localization,” inProc. AAAI, vol. 34, no. 01, 2020, pp. 279–286

2020

[28] [28]

Modality attention for end-to-end audio-visual speech recognition,

P. Zhou, W. Yang, W. Chen, Y . Wang, and J. Jia, “Modality attention for end-to-end audio-visual speech recognition,” inProc. ICASSP. IEEE, 2019, pp. 6565–6569

2019

[29] [29]

Robust audio-visual speech recognition using bimodal dfsmn with multi-condition training and dropout regulariza- tion,

S. Zhang, M. Leiet al., “Robust audio-visual speech recognition using bimodal dfsmn with multi-condition training and dropout regulariza- tion,” inProc. ICASSP, 2019, pp. 6570–6574

2019

[30] [30]

Training strategies to handle missing modalities for audio-visual expression recognition,

S. Parthasarathy and S. Sundaram, “Training strategies to handle missing modalities for audio-visual expression recognition,” inProc. ICMI, 2020, pp. 400–404

2020

[31] [31]

Enhanced self-supervised multi-view rep- resentations with modality-missing robustness for audio-visual speech recognition,

F. Su, C. Li, and J. Liu, “Enhanced self-supervised multi-view rep- resentations with modality-missing robustness for audio-visual speech recognition,” inProc. ICME, 2025, pp. 1–6

2025

[32] [32]

Audio-visual deep learning for noise robust speech recognition,

J. Huang and B. Kingsbury, “Audio-visual deep learning for noise robust speech recognition,” inProc. ICASSP. IEEE, 2013, pp. 7596–7599

2013

[33] [33]

Deep multimodal learning for audio-visual speech recognition,

Y . Mroueh, E. Marcheret, and V . Goel, “Deep multimodal learning for audio-visual speech recognition,” inProc. ICASSP. IEEE, 2015, pp. 2130–2134

2015

[34] [34]

End-to-end audio-visual speech recognition with conformers,

P. Ma, S. Petridis, and M. Pantic, “End-to-end audio-visual speech recognition with conformers,” inProc. ICASSP, 2021, pp. 7613–7617

2021

[35] [35]

Lip reading sentences in the wild,

J. Son Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” inProc. CVPR, 2017, pp. 6447–6456

2017

[36] [36]

Zero- avsr: Zero-shot audio-visual speech recognition with llms by learning language-agnostic speech representations,

J. H. Yeo, M. Kim, C. W. Kim, S. Petridis, and Y . M. Ro, “Zero- avsr: Zero-shot audio-visual speech recognition with llms by learning language-agnostic speech representations,” inProc. ICCV, 2025, pp. 6693–6703

2025

[37] [37]

Mitigating at- tention sinks and massive activations in audio-visual speech recognition with llms,

A. Anand, U. Cappellazzo, S. Petridis, and M. Pantic, “Mitigating at- tention sinks and massive activations in audio-visual speech recognition with llms,” inProc. ICASSP. IEEE, 2026, pp. 17 942–17 946

2026

[38] [38]

Multi- angle lipreading using angle classification and angle-specific feature integration,

S. Isobe, S. Tamura, S. Hayamizu, Y . Gotoh, and M. Nose, “Multi- angle lipreading using angle classification and angle-specific feature integration,” inProc. ICCSPA. IEEE, 2021, pp. 1–5

2021

[39] [39]

Efficient multi-angle audio-visual speech recognition using parallel wavegan based scene classifier

S. Isobe, S. Tamura, Y . Gotoh, and M. Nose, “Efficient multi-angle audio-visual speech recognition using parallel wavegan based scene classifier.” inProc. ICPRAM, 2022, pp. 449–460

2022

[40] [40]

Audio-visual speech recognition in-the-wild: Multi-angle vehicle cabin corpus and attention-based method,

A. Axyonov, D. Ryumin, D. Ivanko, A. Kashevnik, and A. Karpov, “Audio-visual speech recognition in-the-wild: Multi-angle vehicle cabin corpus and attention-based method,” inProc. ICASSP. IEEE, 2024, pp. 8195–8199

2024

[41] [41]

Multi-view spectral clustering with adaptive graph learning and tensor schatten p-norm,

Y . Zhao, Y . Yun, X. Zhang, Q. Li, and Q. Gao, “Multi-view spectral clustering with adaptive graph learning and tensor schatten p-norm,” Neurocomputing, vol. 468, pp. 257–264, 2022

2022

[42] [42]

Umiformer: Mining the correlations between similar tokens for multi-view 3d reconstruction,

Z. Zhu, L. Yang, N. Li, C. Jiang, and Y . Liang, “Umiformer: Mining the correlations between similar tokens for multi-view 3d reconstruction,” inProc. ICCV, 2023, pp. 18 226–18 235

2023

[43] [43]

Vsformer: Mining correlations in flexible view set for multi-view 3d shape under- standing,

H. Sun, Y . Wang, P. Wang, H. Deng, X. Cai, and D. Li, “Vsformer: Mining correlations in flexible view set for multi-view 3d shape under- standing,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 4, pp. 2127–2141, 2024

2024

[44] [44]

Viewclr: Learning self-supervised video representation for unseen viewpoints,

S. Das and M. S. Ryoo, “Viewclr: Learning self-supervised video representation for unseen viewpoints,” inProc. WACV, 2023, pp. 5573– 5583

2023

[45] [45]

Injecting visual features into whisper for parameter-efficient noise- robust audio-visual speech recognition,

Z. Yang, Y . H. Yeo, R. Jiang, X. Fu, W. Chen, W. Xi, and J. Zhao, “Injecting visual features into whisper for parameter-efficient noise- robust audio-visual speech recognition,” inProc. ICASSP, 2025, pp. 1–5

2025

[46] [46]

Audio-visual representa- tion learning via knowledge distillation from speech foundation models,

J.-X. Zhang, G. Wan, J. Gao, and Z.-H. Ling, “Audio-visual representa- tion learning via knowledge distillation from speech foundation models,” Pattern Recognition, vol. 162, p. 111432, 2025

2025

[47] [47]

Improving noise robust audio-visual speech recognition via router-gated cross- modal feature fusion,

D. Lim, Y . Kim, D.-H. Kim, D.-H. Yang, and J.-H. Chang, “Improving noise robust audio-visual speech recognition via router-gated cross- modal feature fusion,” inProc. ASRU, 2025, pp. 1–7

2025

[48] [48]

Multimodal machine learning: A survey and taxonomy,

T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,”IEEE transactions on pattern anal- ysis and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018

2018

[49] [49]

Audio-visual speech recognition with a hybrid ctc/attention architec- ture,

S. Petridis, T. Stafylakis, P. Ma, G. Tzimiropoulos, and M. Pantic, “Audio-visual speech recognition with a hybrid ctc/attention architec- ture,” inProc. SLT. IEEE, 2018, pp. 513–520

2018

[50] [50]

Disentan- gled speech embeddings using cross-modal self-supervision,

A. Nagrani, J. S. Chung, S. Albanie, and A. Zisserman, “Disentan- gled speech embeddings using cross-modal self-supervision,” inProc. ICASSP, 2020, pp. 6829–6833

2020

[51] [51]

Learning audio- visual speech representation by masked multimodal cluster prediction,

B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio- visual speech representation by masked multimodal cluster prediction,” inProc. ICLR, 2022

2022

[52] [52]

Videobert: A joint model for video and language representation learning,

C. Sun, A. Myers, C. V ondrick, K. Murphy, and C. Schmid, “Videobert: A joint model for video and language representation learning,” inProc. ICCV, 2019, pp. 7464–7473. 14

2019

[53] [53]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProc. ICML. PmLR, 2021, pp. 8748–8763

2021

[54] [54]

Flamingo: a visual language model for few-shot learning,

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millicah, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,” inProc. NeurIPS, 2022, pp. 23 716–23 736

2022

[55] [55]

Out of time: automated lip sync in the wild,

J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” inProc. ACCV, 2016, pp. 251–263

2016

[56] [56]

End-to-end audio-visual neural speaker diarization,

M.-K. He, J. Du, and C.-H. Lee, “End-to-end audio-visual neural speaker diarization,” inProc. Interspeech, 2022, pp. 1461–1465

2022

[57] [57]

Combining multiple probability predictions using a simple logit model,

V . A. Satop ¨a¨a, J. Baron, D. P. Foster, B. A. Mellers, P. E. Tetlock, and L. H. Ungar, “Combining multiple probability predictions using a simple logit model,”International Journal of Forecasting, vol. 30, no. 2, pp. 344–356, 2014

2014

[58] [58]

On giant’s shoulders: Effortless weak to strong by dynamic logits fusion,

C. Fan, Z. Lu, W. Wei, J. Tian, X. Qu, D. Chen, and Y . Cheng, “On giant’s shoulders: Effortless weak to strong by dynamic logits fusion,”Advances in Neural Information Processing Systems, vol. 37, pp. 29 986–30 014, 2024

2024

[59] [59]

Hy- brid ctc/attention architecture for end-to-end speech recognition,

S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hy- brid ctc/attention architecture for end-to-end speech recognition,”IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017

2017

[60] [60]

A cascade sequence-to-sequence model for chinese mandarin lip reading,

Y . Zhao, R. Xu, and M. Song, “A cascade sequence-to-sequence model for chinese mandarin lip reading,” inProc. ACM Multimedia Asia, 2019, pp. 1–6

2019

[61] [61]

Audio-visual speech recognition in misp2021 challenge: Dataset release and deep analysis,

H. Chen, J. Du, Y . Dai, S. Siniscalchi, S. Watanabe, O. Scharenborg, J. Chen, J. Panet al., “Audio-visual speech recognition in misp2021 challenge: Dataset release and deep analysis,” inProc. Interspeech, vol. 2022, 2022, pp. 1766–1770

2022

[62] [62]

Cn-cvs: A mandarin audio-visual dataset for large vocabulary continuous visual to speech synthesis,

C. Chen, D. Wang, and T. F. Zheng, “Cn-cvs: A mandarin audio-visual dataset for large vocabulary continuous visual to speech synthesis,” in Proc. ICASSP, 2023, pp. 1–5

2023

[63] [63]

Muavic: A multilingual audio-visual corpus for robust speech recogni- tion and robust speech-to-text translation,

M. Anwar, B. Shi, V . Goswami, W.-N. Hsu, J. M. Pino, and C. Wang, “Muavic: A multilingual audio-visual corpus for robust speech recogni- tion and robust speech-to-text translation,”arXiv:2303.00628, 2023

work page arXiv 2023

[64] [64]

Ave speech: A comprehensive multimodal dataset for speech recognition integrating audio, visual, and electromyographic signals,

D. Zhou, Y . Zhang, J. Wu, X. Zhang, L. Xie, and E. Yin, “Ave speech: A comprehensive multimodal dataset for speech recognition integrating audio, visual, and electromyographic signals,”IEEE Transactions on Human-Machine Systems, vol. 55, no. 4, pp. 559–568, 2025

2025

[65] [65]

Sample and computation redistribution for efficient face detection,

J. Guo, J. Deng, A. Lattas, and S. Zafeiriou, “Sample and computation redistribution for efficient face detection,”arXiv:2105.04714, 2021

work page arXiv 2021

[66] [66]

Arcface: Additive angular margin loss for deep face recognition,

J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” inProc. CVPR, 2019, pp. 4685– 4694

2019

[67] [67]

The Faiss library

M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazar´e, M. Lomeli, L. Hosseini, and H. J ´egou, “The faiss library,” arXiv:2401.08281, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [68]

Billion-scale similarity search with GPUs,

J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535– 547, 2019

2019

[69] [69]

V oxCeleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep speaker recognition,” inProc. Interspeech, 2018, pp. 1086–1090

2018

[70] [70]

The first multimodal information based speech processing (misp) challenge: Data, tasks, baselines and results,

H. Chen, H. Zhou, D. Jun, C.-H. Lee, J. Chen, S. Watanabe, S. M. Siniscalchi, O. Scharenborg, D.-Y . Liu, B.-C. Yinet al., “The first multimodal information based speech processing (misp) challenge: Data, tasks, baselines and results,” inProc. ICASSP, 2022, pp. 9266–9270

2022

[71] [71]

Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis,

I. Anina, Z. Zhouet al., “Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis,” inProc. FG, 2015, pp. 1–5

2015

[72] [72]

Recurrent neural network transducer for audio-visual speech recognition,

T. Makino, H. Liaoet al., “Recurrent neural network transducer for audio-visual speech recognition,” inProc. ASRU, 2019, pp. 905–912

2019

[73] [73]

Auto-avsr: Audio-visual speech recognition with automatic labels,

P. Ma, A. Haliassoset al., “Auto-avsr: Audio-visual speech recognition with automatic labels,” inProc. ICASSP, 2023, pp. 1–5

2023

[74] [74]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,”arXiv preprint arXiv:1510.08484, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[75] [75]

Nara- wpe: A python package for weighted prediction error dereverberation in numpy and tensorflow for online and offline processing,

L. Drude, J. Heymann, C. Boeddeker, and R. Haeb-Umbach, “Nara- wpe: A python package for weighted prediction error dereverberation in numpy and tensorflow for online and offline processing,” inSpeech Communication; 13th ITG-Symposium. VDE, 2018, pp. 1–5

2018

[76] [76]

Front-end processing for the chime-5 dinner party scenario,

C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Hey- mann, and R. Haeb-Umbach, “Front-end processing for the chime-5 dinner party scenario,” inProc. CHiME 2018, 2018, pp. 35–40

2018

[77] [77]

Specaugment: A simple data augmentation method for automatic speech recognition,

D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” inProc. Interspeech, 2019, pp. 2613– 2617

2019

[78] [78]

Lipreading using temporal convolutional networks,

B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading using temporal convolutional networks,” inProc. ICASSP, 2020, pp. 6319– 6323

2020

[79] [79]

Watch or listen: Robust audio- visual speech recognition with visual corruption modeling and reliability scoring,

J. Hong, M. Kim, J. Choi, and Y . M. Ro, “Watch or listen: Robust audio- visual speech recognition with visual corruption modeling and reliability scoring,” inProc. CVPR, 2023, pp. 18 783–18 794

2023

[80] [80]

Learning video temporal dynamics with cross-modal attention for robust audio-visual speech recognition,

S. Kim, K. Jang, S. Bae, H. Kim, and S.-Y . Yun, “Learning video temporal dynamics with cross-modal attention for robust audio-visual speech recognition,” inProc. SLT, 2024, pp. 447–454

2024