pith. sign in

arxiv: 2606.05763 · v1 · pith:27BENKZVnew · submitted 2026-06-04 · 📡 eess.AS · cs.SD

M2S-AVSR: Modality-aware Multi-view Self-supervised Representation for Robust Audio-Visual Speech Recognition

Pith reviewed 2026-06-27 23:56 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords audio-visual speech recognitionself-supervised learningmulti-view representationmodality-aware fusionview-invariant featuresrobust AVSRreal-scene datasetLRS3 benchmark
0
0 comments X

The pith

A modality-aware multi-view self-supervised framework learns view-invariant visual speech features and quality-aware fusion to boost robustness in audio-visual speech recognition under real-world distortions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes M2S-AVSR to handle viewpoint variation, audio distortion, and visual occlusion that degrade audio-visual speech recognition. It trains a multi-view encoder to extract view-invariant visual representations and adds a modality-aware module that scores quality and synchrony to guide fine-grained fusion during decoding. The approach is tested on English and Mandarin benchmarks plus a newly collected multi-scenario dataset, showing large gains precisely when inputs are perturbed or degraded. The central goal is to make visual cues reliable enough to improve recognition outside studio conditions.

Core claim

The M2S-AVSR framework first trains a multi-view representation learning encoder to produce view-invariant visual speech features, then applies a modality-aware module that explicitly estimates modality quality and cross-modal synchrony to control fine-grained visual information injection during decoding. On LRS3 this yields up to 29.4 percent relative improvement under viewpoint perturbation and visual degradation; the method sets a new state-of-the-art on the MISP2021-AVSR test set and records the best result among compared systems in outdoor scenes on the introduced AISHELL8-RealScene dataset.

What carries the argument

The multi-view representation learning encoder paired with the modality-aware fusion module, which together generate view-invariant features and selectively inject visual information according to estimated quality and synchrony.

If this is right

  • Up to 29.4 percent relative word-error-rate reduction occurs on LRS3 when viewpoint and visual quality are perturbed.
  • New state-of-the-art performance is reached on the MISP2021-AVSR test set.
  • Best reported results among compared systems appear in outdoor scenes of the AISHELL8-RealScene benchmark.
  • The framework supplies explicit support for conversational speech recognition in realistic multi-view environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the view-invariant encoder generalizes, the same pre-training strategy could be applied to lip-reading systems that must handle arbitrary camera placements.
  • Explicit synchrony modeling may reduce errors in remote-conferencing settings where audio and video arrive with variable delay.
  • Publication of the multi-scenario dataset may accelerate development of outdoor and multi-view audio-visual benchmarks.
  • The quality-aware fusion idea could transfer to other multimodal tasks such as audio-visual event localization under occlusion.

Load-bearing premise

View-invariant visual features produced by the encoder remain useful when fused with audio even under real-world asynchrony and occlusion.

What would settle it

A new test set containing extreme camera angles and heavy visual occlusion on which the reported word-error-rate reductions disappear or reverse would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.05763 by Cancan Li, Fei Su, Juan Liu, Ming Li.

Figure 1
Figure 1. Figure 1: Illustration of AVSR performance degradation under challenging [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed M2S-AVSR framework. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Detailed structure of the modality-aware module. The upper branch [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the AISHELL8-RealScene recording scenes. (a) Schematic illustration of the recording scenes, where [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution analysis of AISHELL8-RealScene dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Occlusion-based lip sensitivity visualization for self-supervised multi [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of modality-aware fusion on the MISP2021 dataset. Test sample: R16 [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Evaluation on multi-view data (VSR WER (%)). “30h” and “433h” [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
read the original abstract

Audio-Visual Speech Recognition (AVSR) enhances speech recognition robustness by leveraging visual cues, while real-world scenarios remain challenging due to viewpoint variation, audio distortion, and visual occlusion, which degrade modality quality and increase audio-visual asynchrony. In this paper, we propose a novel Modality-aware Multi-view Self-supervised representation framework for robust Audio-Visual Speech Recognition (M2S-AVSR). First, we introduce a multi-view representation learning encoder to learn view-invariant visual speech representations. Next, we employ a modality-aware module that explicitly models modality quality and cross-modal synchrony to perform fine-grained modality-aware fusion, enabling fine-grained visual information injection during decoding. In addition, we present AISHELL8-RealScene, a public multi-scenario, multi-view conversational audio-visual dataset recorded in real-world environments, and establish a speech recognition benchmark on it. Experiments on English and Mandarin benchmarks demonstrate the effectiveness of the proposed method under challenging conditions. On LRS3, M2S-AVSR achieves up to 29.4% relative improvement under viewpoint perturbation and visual degradation settings. Our method also achieves new state-of-the-art performance on the MISP2021-AVSR test set. On AISHELL8-RealScene, it achieves the best result in outdoor scenes. The proposed method and dataset provide useful support for future research on robust speech and multimodal tasks under realistic conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes M2S-AVSR, a modality-aware multi-view self-supervised framework for robust audio-visual speech recognition. It introduces a multi-view representation learning encoder to produce view-invariant visual speech features and a modality-aware fusion module that explicitly models modality quality and cross-modal synchrony for fine-grained fusion during decoding. The authors also release the AISHELL8-RealScene dataset (multi-scenario, multi-view conversational recordings in real-world environments) and report up to 29.4% relative improvement on LRS3 under viewpoint perturbation and visual degradation, new SOTA on the MISP2021-AVSR test set, and best results in outdoor scenes on the new dataset.

Significance. If the experimental results hold under rigorous controls, the work would advance robust AVSR by directly targeting real-world degradations (viewpoint variation, occlusion, asynchrony) via self-supervised view-invariant features and quality-aware fusion. The public release of AISHELL8-RealScene provides a concrete, falsifiable benchmark for future multimodal speech research in realistic conditions.

minor comments (2)
  1. The abstract reports relative improvements and SOTA claims without absolute WER numbers or explicit baseline comparisons; adding these (e.g., in a results table) would strengthen verifiability of the 29.4% figure.
  2. The description of the multi-view encoder and modality-aware module is high-level; the main text should include precise architectural diagrams, loss formulations, and training hyperparameters to support reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the significance for robust AVSR under real-world degradations, and recommendation for minor revision. The manuscript presents M2S-AVSR with multi-view self-supervised visual encoding and modality-aware fusion, releases the AISHELL8-RealScene dataset, and reports gains on LRS3, MISP2021, and the new dataset. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes an architectural framework (multi-view encoder for view-invariant features + modality-aware fusion) and reports empirical gains on LRS3 perturbations and MISP2021. No derivation chain, equation, or result is shown to reduce by construction to fitted parameters, self-citations, or renamed inputs. Performance numbers are presented as experimental outcomes on external benchmarks and a newly introduced dataset, with no load-bearing self-referential loops or uniqueness theorems invoked from prior author work. The derivation is self-contained against external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, training objectives, or implementation details; therefore no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5794 in / 1159 out tokens · 18238 ms · 2026-06-27T23:56:51.897262+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

93 extracted references · 9 canonical work pages · 4 internal anchors

  1. [1]

    Speech recognition with deep recurrent neural networks,

    A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” inProc. ICASSP, 2013, pp. 6645–6649

  2. [2]

    Joint ctc-attention based end-to-end speech recognition using multi-task learning,

    S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” inProc. ICASSP, 2017, pp. 4835–4839

  3. [3]

    E2e-sincnet: Toward fully end-to-end speech recognition,

    T. Parcollet, M. Morchid, and G. Linares, “E2e-sincnet: Toward fully end-to-end speech recognition,” inProc. ICASSP, 2020, pp. 7714–7718. 13

  4. [4]

    Complex spectral mapping for single-and multi-channel speech enhancement and robust asr,

    Z.-Q. Wang, P. Wang, and D. Wang, “Complex spectral mapping for single-and multi-channel speech enhancement and robust asr,” IEEE/ACM transactions on audio, speech, and language processing, vol. 28, pp. 1778–1787, 2020

  5. [5]

    Towards efficient models for real-time deep noise suppression,

    S. Braun, H. Gamper, C. K. Reddy, and I. Tashev, “Towards efficient models for real-time deep noise suppression,” inProc. ICASSP. IEEE, 2021, pp. 656–660

  6. [6]

    Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,

    K. Tan and D. Wang, “Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 380–390, 2019

  7. [7]

    End-to-end audiovisual speech recognition,

    S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic, “End-to-end audiovisual speech recognition,” inProc. ICASSP. IEEE, 2018, pp. 6548–6552

  8. [8]

    Visual context-driven audio feature enhancement for robust end-to-end audio-visual speech recognition,

    J. Hong, M. Kim, D. Yoo, and Y . M. Ro, “Visual context-driven audio feature enhancement for robust end-to-end audio-visual speech recognition,” inProc. Interspeech, 2022, pp. 2838–2842

  9. [9]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  10. [10]

    Conformer: Convolution-augmented transformer for speech recognition,

    A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wuet al., “Conformer: Convolution-augmented transformer for speech recognition,”arXiv:2005.08100, 2020

  11. [11]

    Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,

    A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” inProc. ICML, 2006, pp. 369–376

  12. [12]

    Deep audio-visual speech recognition,

    T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 12, pp. 8717–8727, 2018

  13. [13]

    Learning contextually fused audio-visual representations for audio- visual speech recognition,

    Z.-Q. Zhang, J. Zhang, J.-S. Zhang, M.-H. Wu, X. Fang, and L.-R. Dai, “Learning contextually fused audio-visual representations for audio- visual speech recognition,” inProc. ICIP. IEEE, 2022, pp. 1346–1350

  14. [14]

    wav2vec 2.0: A framework for self-supervised learning of speech representations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, 2020, pp. 12 449–12 460

  15. [15]

    Wavlm: Large-scale self-supervised pre- training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  16. [16]

    Robust speech recognition via large-scale weak super- vision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak super- vision,” inProc. ICML, 2023, pp. 28 492–28 518

  17. [17]

    Robust Self-Supervised Audio- Visual Speech Recognition,

    B. Shi, W.-N. Hsu, and A. Mohamed, “Robust Self-Supervised Audio- Visual Speech Recognition,” inProc. Interspeech, 2022, pp. 2118–2122

  18. [18]

    Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition,

    X. Pan, P. Chen, Y . Gong, H. Zhou, X. Wang, and Z. Lin, “Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition,” inProc. ACL, 2022, pp. 4491–4503

  19. [19]

    Self-supervised adaptive av fusion module for pre-trained asr models,

    C. Simic and T. Bocklet, “Self-supervised adaptive av fusion module for pre-trained asr models,” inProc. ICASSP. IEEE, 2024, pp. 12 787– 12 791

  20. [20]

    Whisper-flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation,

    A. Rouditchenko, Y . Gong, S. Thomas, L. Karlinsky, H. Kuehne, R. Feris, and J. Glass, “Whisper-flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation,” in Proc. Interspeech, 2024, pp. 2420–2424

  21. [21]

    Where visual speech meets language: Vsp-llm framework for efficient and context-aware visual speech processing,

    J. Yeo, S. Hanet al., “Where visual speech meets language: Vsp-llm framework for efficient and context-aware visual speech processing,” in Proc. EMNLP, 2024, pp. 11 391–11 406

  22. [22]

    Large language models are strong audio-visual speech recognition learners,

    U. Cappellazzo, M. Kim, H. Chen, P. Maet al., “Large language models are strong audio-visual speech recognition learners,” inProc. ICASSP, 2025, pp. 1–5

  23. [23]

    Mms-llama: Efficient llm- based audio-visual speech recognition with minimal multimodal speech tokens,

    J. H. Yeo, H. Rha, S. J. Park, and Y . M. Ro, “Mms-llama: Efficient llm- based audio-visual speech recognition with minimal multimodal speech tokens,” inProc. ACL, 2025, pp. 20 724–20 735

  24. [24]

    LRS3-TED: a large-scale dataset for visual speech recognition

    T. Afouras, J. S. Chung, and A. Zisserman, “Lrs3-ted: a large-scale dataset for visual speech recognition,”arXiv preprint arXiv:1809.00496, 2018

  25. [25]

    Purification before fusion: Toward mask-free speech enhance- ment for robust audio-visual speech recognition,

    L. Wu, X. Zhang, H. Yuan, Y . Zhang, C. Zheng, L. Xie, T. Liu, and E. Yin, “Purification before fusion: Toward mask-free speech enhance- ment for robust audio-visual speech recognition,” inProc. ICASSP, 2026, pp. 17 932–17 936

  26. [26]

    Attention bottlenecks for multimodal fusion,

    A. Nagrani, S. Yanget al., “Attention bottlenecks for multimodal fusion,”Advances in Neural Information Processing Systems, vol. 34, 2021

  27. [27]

    Cross-modal attention network for temporal inconsistent audio-visual event localization,

    H. Xuan, Z. Zhanget al., “Cross-modal attention network for temporal inconsistent audio-visual event localization,” inProc. AAAI, vol. 34, no. 01, 2020, pp. 279–286

  28. [28]

    Modality attention for end-to-end audio-visual speech recognition,

    P. Zhou, W. Yang, W. Chen, Y . Wang, and J. Jia, “Modality attention for end-to-end audio-visual speech recognition,” inProc. ICASSP. IEEE, 2019, pp. 6565–6569

  29. [29]

    Robust audio-visual speech recognition using bimodal dfsmn with multi-condition training and dropout regulariza- tion,

    S. Zhang, M. Leiet al., “Robust audio-visual speech recognition using bimodal dfsmn with multi-condition training and dropout regulariza- tion,” inProc. ICASSP, 2019, pp. 6570–6574

  30. [30]

    Training strategies to handle missing modalities for audio-visual expression recognition,

    S. Parthasarathy and S. Sundaram, “Training strategies to handle missing modalities for audio-visual expression recognition,” inProc. ICMI, 2020, pp. 400–404

  31. [31]

    Enhanced self-supervised multi-view rep- resentations with modality-missing robustness for audio-visual speech recognition,

    F. Su, C. Li, and J. Liu, “Enhanced self-supervised multi-view rep- resentations with modality-missing robustness for audio-visual speech recognition,” inProc. ICME, 2025, pp. 1–6

  32. [32]

    Audio-visual deep learning for noise robust speech recognition,

    J. Huang and B. Kingsbury, “Audio-visual deep learning for noise robust speech recognition,” inProc. ICASSP. IEEE, 2013, pp. 7596–7599

  33. [33]

    Deep multimodal learning for audio-visual speech recognition,

    Y . Mroueh, E. Marcheret, and V . Goel, “Deep multimodal learning for audio-visual speech recognition,” inProc. ICASSP. IEEE, 2015, pp. 2130–2134

  34. [34]

    End-to-end audio-visual speech recognition with conformers,

    P. Ma, S. Petridis, and M. Pantic, “End-to-end audio-visual speech recognition with conformers,” inProc. ICASSP, 2021, pp. 7613–7617

  35. [35]

    Lip reading sentences in the wild,

    J. Son Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” inProc. CVPR, 2017, pp. 6447–6456

  36. [36]

    Zero- avsr: Zero-shot audio-visual speech recognition with llms by learning language-agnostic speech representations,

    J. H. Yeo, M. Kim, C. W. Kim, S. Petridis, and Y . M. Ro, “Zero- avsr: Zero-shot audio-visual speech recognition with llms by learning language-agnostic speech representations,” inProc. ICCV, 2025, pp. 6693–6703

  37. [37]

    Mitigating at- tention sinks and massive activations in audio-visual speech recognition with llms,

    A. Anand, U. Cappellazzo, S. Petridis, and M. Pantic, “Mitigating at- tention sinks and massive activations in audio-visual speech recognition with llms,” inProc. ICASSP. IEEE, 2026, pp. 17 942–17 946

  38. [38]

    Multi- angle lipreading using angle classification and angle-specific feature integration,

    S. Isobe, S. Tamura, S. Hayamizu, Y . Gotoh, and M. Nose, “Multi- angle lipreading using angle classification and angle-specific feature integration,” inProc. ICCSPA. IEEE, 2021, pp. 1–5

  39. [39]

    Efficient multi-angle audio-visual speech recognition using parallel wavegan based scene classifier

    S. Isobe, S. Tamura, Y . Gotoh, and M. Nose, “Efficient multi-angle audio-visual speech recognition using parallel wavegan based scene classifier.” inProc. ICPRAM, 2022, pp. 449–460

  40. [40]

    Audio-visual speech recognition in-the-wild: Multi-angle vehicle cabin corpus and attention-based method,

    A. Axyonov, D. Ryumin, D. Ivanko, A. Kashevnik, and A. Karpov, “Audio-visual speech recognition in-the-wild: Multi-angle vehicle cabin corpus and attention-based method,” inProc. ICASSP. IEEE, 2024, pp. 8195–8199

  41. [41]

    Multi-view spectral clustering with adaptive graph learning and tensor schatten p-norm,

    Y . Zhao, Y . Yun, X. Zhang, Q. Li, and Q. Gao, “Multi-view spectral clustering with adaptive graph learning and tensor schatten p-norm,” Neurocomputing, vol. 468, pp. 257–264, 2022

  42. [42]

    Umiformer: Mining the correlations between similar tokens for multi-view 3d reconstruction,

    Z. Zhu, L. Yang, N. Li, C. Jiang, and Y . Liang, “Umiformer: Mining the correlations between similar tokens for multi-view 3d reconstruction,” inProc. ICCV, 2023, pp. 18 226–18 235

  43. [43]

    Vsformer: Mining correlations in flexible view set for multi-view 3d shape under- standing,

    H. Sun, Y . Wang, P. Wang, H. Deng, X. Cai, and D. Li, “Vsformer: Mining correlations in flexible view set for multi-view 3d shape under- standing,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 4, pp. 2127–2141, 2024

  44. [44]

    Viewclr: Learning self-supervised video representation for unseen viewpoints,

    S. Das and M. S. Ryoo, “Viewclr: Learning self-supervised video representation for unseen viewpoints,” inProc. WACV, 2023, pp. 5573– 5583

  45. [45]

    Injecting visual features into whisper for parameter-efficient noise- robust audio-visual speech recognition,

    Z. Yang, Y . H. Yeo, R. Jiang, X. Fu, W. Chen, W. Xi, and J. Zhao, “Injecting visual features into whisper for parameter-efficient noise- robust audio-visual speech recognition,” inProc. ICASSP, 2025, pp. 1–5

  46. [46]

    Audio-visual representa- tion learning via knowledge distillation from speech foundation models,

    J.-X. Zhang, G. Wan, J. Gao, and Z.-H. Ling, “Audio-visual representa- tion learning via knowledge distillation from speech foundation models,” Pattern Recognition, vol. 162, p. 111432, 2025

  47. [47]

    Improving noise robust audio-visual speech recognition via router-gated cross- modal feature fusion,

    D. Lim, Y . Kim, D.-H. Kim, D.-H. Yang, and J.-H. Chang, “Improving noise robust audio-visual speech recognition via router-gated cross- modal feature fusion,” inProc. ASRU, 2025, pp. 1–7

  48. [48]

    Multimodal machine learning: A survey and taxonomy,

    T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,”IEEE transactions on pattern anal- ysis and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018

  49. [49]

    Audio-visual speech recognition with a hybrid ctc/attention architec- ture,

    S. Petridis, T. Stafylakis, P. Ma, G. Tzimiropoulos, and M. Pantic, “Audio-visual speech recognition with a hybrid ctc/attention architec- ture,” inProc. SLT. IEEE, 2018, pp. 513–520

  50. [50]

    Disentan- gled speech embeddings using cross-modal self-supervision,

    A. Nagrani, J. S. Chung, S. Albanie, and A. Zisserman, “Disentan- gled speech embeddings using cross-modal self-supervision,” inProc. ICASSP, 2020, pp. 6829–6833

  51. [51]

    Learning audio- visual speech representation by masked multimodal cluster prediction,

    B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio- visual speech representation by masked multimodal cluster prediction,” inProc. ICLR, 2022

  52. [52]

    Videobert: A joint model for video and language representation learning,

    C. Sun, A. Myers, C. V ondrick, K. Murphy, and C. Schmid, “Videobert: A joint model for video and language representation learning,” inProc. ICCV, 2019, pp. 7464–7473. 14

  53. [53]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProc. ICML. PmLR, 2021, pp. 8748–8763

  54. [54]

    Flamingo: a visual language model for few-shot learning,

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millicah, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,” inProc. NeurIPS, 2022, pp. 23 716–23 736

  55. [55]

    Out of time: automated lip sync in the wild,

    J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” inProc. ACCV, 2016, pp. 251–263

  56. [56]

    End-to-end audio-visual neural speaker diarization,

    M.-K. He, J. Du, and C.-H. Lee, “End-to-end audio-visual neural speaker diarization,” inProc. Interspeech, 2022, pp. 1461–1465

  57. [57]

    Combining multiple probability predictions using a simple logit model,

    V . A. Satop ¨a¨a, J. Baron, D. P. Foster, B. A. Mellers, P. E. Tetlock, and L. H. Ungar, “Combining multiple probability predictions using a simple logit model,”International Journal of Forecasting, vol. 30, no. 2, pp. 344–356, 2014

  58. [58]

    On giant’s shoulders: Effortless weak to strong by dynamic logits fusion,

    C. Fan, Z. Lu, W. Wei, J. Tian, X. Qu, D. Chen, and Y . Cheng, “On giant’s shoulders: Effortless weak to strong by dynamic logits fusion,”Advances in Neural Information Processing Systems, vol. 37, pp. 29 986–30 014, 2024

  59. [59]

    Hy- brid ctc/attention architecture for end-to-end speech recognition,

    S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hy- brid ctc/attention architecture for end-to-end speech recognition,”IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017

  60. [60]

    A cascade sequence-to-sequence model for chinese mandarin lip reading,

    Y . Zhao, R. Xu, and M. Song, “A cascade sequence-to-sequence model for chinese mandarin lip reading,” inProc. ACM Multimedia Asia, 2019, pp. 1–6

  61. [61]

    Audio-visual speech recognition in misp2021 challenge: Dataset release and deep analysis,

    H. Chen, J. Du, Y . Dai, S. Siniscalchi, S. Watanabe, O. Scharenborg, J. Chen, J. Panet al., “Audio-visual speech recognition in misp2021 challenge: Dataset release and deep analysis,” inProc. Interspeech, vol. 2022, 2022, pp. 1766–1770

  62. [62]

    Cn-cvs: A mandarin audio-visual dataset for large vocabulary continuous visual to speech synthesis,

    C. Chen, D. Wang, and T. F. Zheng, “Cn-cvs: A mandarin audio-visual dataset for large vocabulary continuous visual to speech synthesis,” in Proc. ICASSP, 2023, pp. 1–5

  63. [63]

    Muavic: A multilingual audio-visual corpus for robust speech recogni- tion and robust speech-to-text translation,

    M. Anwar, B. Shi, V . Goswami, W.-N. Hsu, J. M. Pino, and C. Wang, “Muavic: A multilingual audio-visual corpus for robust speech recogni- tion and robust speech-to-text translation,”arXiv:2303.00628, 2023

  64. [64]

    Ave speech: A comprehensive multimodal dataset for speech recognition integrating audio, visual, and electromyographic signals,

    D. Zhou, Y . Zhang, J. Wu, X. Zhang, L. Xie, and E. Yin, “Ave speech: A comprehensive multimodal dataset for speech recognition integrating audio, visual, and electromyographic signals,”IEEE Transactions on Human-Machine Systems, vol. 55, no. 4, pp. 559–568, 2025

  65. [65]

    Sample and computation redistribution for efficient face detection,

    J. Guo, J. Deng, A. Lattas, and S. Zafeiriou, “Sample and computation redistribution for efficient face detection,”arXiv:2105.04714, 2021

  66. [66]

    Arcface: Additive angular margin loss for deep face recognition,

    J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” inProc. CVPR, 2019, pp. 4685– 4694

  67. [67]

    The Faiss library

    M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazar´e, M. Lomeli, L. Hosseini, and H. J ´egou, “The faiss library,” arXiv:2401.08281, 2024

  68. [68]

    Billion-scale similarity search with GPUs,

    J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535– 547, 2019

  69. [69]

    V oxCeleb2: Deep speaker recognition,

    J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep speaker recognition,” inProc. Interspeech, 2018, pp. 1086–1090

  70. [70]

    The first multimodal information based speech processing (misp) challenge: Data, tasks, baselines and results,

    H. Chen, H. Zhou, D. Jun, C.-H. Lee, J. Chen, S. Watanabe, S. M. Siniscalchi, O. Scharenborg, D.-Y . Liu, B.-C. Yinet al., “The first multimodal information based speech processing (misp) challenge: Data, tasks, baselines and results,” inProc. ICASSP, 2022, pp. 9266–9270

  71. [71]

    Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis,

    I. Anina, Z. Zhouet al., “Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis,” inProc. FG, 2015, pp. 1–5

  72. [72]

    Recurrent neural network transducer for audio-visual speech recognition,

    T. Makino, H. Liaoet al., “Recurrent neural network transducer for audio-visual speech recognition,” inProc. ASRU, 2019, pp. 905–912

  73. [73]

    Auto-avsr: Audio-visual speech recognition with automatic labels,

    P. Ma, A. Haliassoset al., “Auto-avsr: Audio-visual speech recognition with automatic labels,” inProc. ICASSP, 2023, pp. 1–5

  74. [74]

    MUSAN: A Music, Speech, and Noise Corpus

    D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,”arXiv preprint arXiv:1510.08484, 2015

  75. [75]

    Nara- wpe: A python package for weighted prediction error dereverberation in numpy and tensorflow for online and offline processing,

    L. Drude, J. Heymann, C. Boeddeker, and R. Haeb-Umbach, “Nara- wpe: A python package for weighted prediction error dereverberation in numpy and tensorflow for online and offline processing,” inSpeech Communication; 13th ITG-Symposium. VDE, 2018, pp. 1–5

  76. [76]

    Front-end processing for the chime-5 dinner party scenario,

    C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Hey- mann, and R. Haeb-Umbach, “Front-end processing for the chime-5 dinner party scenario,” inProc. CHiME 2018, 2018, pp. 35–40

  77. [77]

    Specaugment: A simple data augmentation method for automatic speech recognition,

    D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” inProc. Interspeech, 2019, pp. 2613– 2617

  78. [78]

    Lipreading using temporal convolutional networks,

    B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading using temporal convolutional networks,” inProc. ICASSP, 2020, pp. 6319– 6323

  79. [79]

    Watch or listen: Robust audio- visual speech recognition with visual corruption modeling and reliability scoring,

    J. Hong, M. Kim, J. Choi, and Y . M. Ro, “Watch or listen: Robust audio- visual speech recognition with visual corruption modeling and reliability scoring,” inProc. CVPR, 2023, pp. 18 783–18 794

  80. [80]

    Learning video temporal dynamics with cross-modal attention for robust audio-visual speech recognition,

    S. Kim, K. Jang, S. Bae, H. Kim, and S.-Y . Yun, “Learning video temporal dynamics with cross-modal attention for robust audio-visual speech recognition,” inProc. SLT, 2024, pp. 447–454

Showing first 80 references.