M2S-AVSR: Modality-aware Multi-view Self-supervised Representation for Robust Audio-Visual Speech Recognition
Pith reviewed 2026-06-27 23:56 UTC · model grok-4.3
The pith
A modality-aware multi-view self-supervised framework learns view-invariant visual speech features and quality-aware fusion to boost robustness in audio-visual speech recognition under real-world distortions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The M2S-AVSR framework first trains a multi-view representation learning encoder to produce view-invariant visual speech features, then applies a modality-aware module that explicitly estimates modality quality and cross-modal synchrony to control fine-grained visual information injection during decoding. On LRS3 this yields up to 29.4 percent relative improvement under viewpoint perturbation and visual degradation; the method sets a new state-of-the-art on the MISP2021-AVSR test set and records the best result among compared systems in outdoor scenes on the introduced AISHELL8-RealScene dataset.
What carries the argument
The multi-view representation learning encoder paired with the modality-aware fusion module, which together generate view-invariant features and selectively inject visual information according to estimated quality and synchrony.
If this is right
- Up to 29.4 percent relative word-error-rate reduction occurs on LRS3 when viewpoint and visual quality are perturbed.
- New state-of-the-art performance is reached on the MISP2021-AVSR test set.
- Best reported results among compared systems appear in outdoor scenes of the AISHELL8-RealScene benchmark.
- The framework supplies explicit support for conversational speech recognition in realistic multi-view environments.
Where Pith is reading between the lines
- If the view-invariant encoder generalizes, the same pre-training strategy could be applied to lip-reading systems that must handle arbitrary camera placements.
- Explicit synchrony modeling may reduce errors in remote-conferencing settings where audio and video arrive with variable delay.
- Publication of the multi-scenario dataset may accelerate development of outdoor and multi-view audio-visual benchmarks.
- The quality-aware fusion idea could transfer to other multimodal tasks such as audio-visual event localization under occlusion.
Load-bearing premise
View-invariant visual features produced by the encoder remain useful when fused with audio even under real-world asynchrony and occlusion.
What would settle it
A new test set containing extreme camera angles and heavy visual occlusion on which the reported word-error-rate reductions disappear or reverse would falsify the central claim.
Figures
read the original abstract
Audio-Visual Speech Recognition (AVSR) enhances speech recognition robustness by leveraging visual cues, while real-world scenarios remain challenging due to viewpoint variation, audio distortion, and visual occlusion, which degrade modality quality and increase audio-visual asynchrony. In this paper, we propose a novel Modality-aware Multi-view Self-supervised representation framework for robust Audio-Visual Speech Recognition (M2S-AVSR). First, we introduce a multi-view representation learning encoder to learn view-invariant visual speech representations. Next, we employ a modality-aware module that explicitly models modality quality and cross-modal synchrony to perform fine-grained modality-aware fusion, enabling fine-grained visual information injection during decoding. In addition, we present AISHELL8-RealScene, a public multi-scenario, multi-view conversational audio-visual dataset recorded in real-world environments, and establish a speech recognition benchmark on it. Experiments on English and Mandarin benchmarks demonstrate the effectiveness of the proposed method under challenging conditions. On LRS3, M2S-AVSR achieves up to 29.4% relative improvement under viewpoint perturbation and visual degradation settings. Our method also achieves new state-of-the-art performance on the MISP2021-AVSR test set. On AISHELL8-RealScene, it achieves the best result in outdoor scenes. The proposed method and dataset provide useful support for future research on robust speech and multimodal tasks under realistic conditions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes M2S-AVSR, a modality-aware multi-view self-supervised framework for robust audio-visual speech recognition. It introduces a multi-view representation learning encoder to produce view-invariant visual speech features and a modality-aware fusion module that explicitly models modality quality and cross-modal synchrony for fine-grained fusion during decoding. The authors also release the AISHELL8-RealScene dataset (multi-scenario, multi-view conversational recordings in real-world environments) and report up to 29.4% relative improvement on LRS3 under viewpoint perturbation and visual degradation, new SOTA on the MISP2021-AVSR test set, and best results in outdoor scenes on the new dataset.
Significance. If the experimental results hold under rigorous controls, the work would advance robust AVSR by directly targeting real-world degradations (viewpoint variation, occlusion, asynchrony) via self-supervised view-invariant features and quality-aware fusion. The public release of AISHELL8-RealScene provides a concrete, falsifiable benchmark for future multimodal speech research in realistic conditions.
minor comments (2)
- The abstract reports relative improvements and SOTA claims without absolute WER numbers or explicit baseline comparisons; adding these (e.g., in a results table) would strengthen verifiability of the 29.4% figure.
- The description of the multi-view encoder and modality-aware module is high-level; the main text should include precise architectural diagrams, loss formulations, and training hyperparameters to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive summary, recognition of the significance for robust AVSR under real-world degradations, and recommendation for minor revision. The manuscript presents M2S-AVSR with multi-view self-supervised visual encoding and modality-aware fusion, releases the AISHELL8-RealScene dataset, and reports gains on LRS3, MISP2021, and the new dataset. No major comments were raised in the report.
Circularity Check
No significant circularity
full rationale
The paper proposes an architectural framework (multi-view encoder for view-invariant features + modality-aware fusion) and reports empirical gains on LRS3 perturbations and MISP2021. No derivation chain, equation, or result is shown to reduce by construction to fitted parameters, self-citations, or renamed inputs. Performance numbers are presented as experimental outcomes on external benchmarks and a newly introduced dataset, with no load-bearing self-referential loops or uniqueness theorems invoked from prior author work. The derivation is self-contained against external validation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Speech recognition with deep recurrent neural networks,
A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” inProc. ICASSP, 2013, pp. 6645–6649
2013
-
[2]
Joint ctc-attention based end-to-end speech recognition using multi-task learning,
S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” inProc. ICASSP, 2017, pp. 4835–4839
2017
-
[3]
E2e-sincnet: Toward fully end-to-end speech recognition,
T. Parcollet, M. Morchid, and G. Linares, “E2e-sincnet: Toward fully end-to-end speech recognition,” inProc. ICASSP, 2020, pp. 7714–7718. 13
2020
-
[4]
Complex spectral mapping for single-and multi-channel speech enhancement and robust asr,
Z.-Q. Wang, P. Wang, and D. Wang, “Complex spectral mapping for single-and multi-channel speech enhancement and robust asr,” IEEE/ACM transactions on audio, speech, and language processing, vol. 28, pp. 1778–1787, 2020
2020
-
[5]
Towards efficient models for real-time deep noise suppression,
S. Braun, H. Gamper, C. K. Reddy, and I. Tashev, “Towards efficient models for real-time deep noise suppression,” inProc. ICASSP. IEEE, 2021, pp. 656–660
2021
-
[6]
Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,
K. Tan and D. Wang, “Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 380–390, 2019
2019
-
[7]
End-to-end audiovisual speech recognition,
S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic, “End-to-end audiovisual speech recognition,” inProc. ICASSP. IEEE, 2018, pp. 6548–6552
2018
-
[8]
Visual context-driven audio feature enhancement for robust end-to-end audio-visual speech recognition,
J. Hong, M. Kim, D. Yoo, and Y . M. Ro, “Visual context-driven audio feature enhancement for robust end-to-end audio-visual speech recognition,” inProc. Interspeech, 2022, pp. 2838–2842
2022
-
[9]
Attention is all you need,
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017
2017
-
[10]
Conformer: Convolution-augmented transformer for speech recognition,
A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wuet al., “Conformer: Convolution-augmented transformer for speech recognition,”arXiv:2005.08100, 2020
-
[11]
Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,
A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” inProc. ICML, 2006, pp. 369–376
2006
-
[12]
Deep audio-visual speech recognition,
T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 12, pp. 8717–8727, 2018
2018
-
[13]
Learning contextually fused audio-visual representations for audio- visual speech recognition,
Z.-Q. Zhang, J. Zhang, J.-S. Zhang, M.-H. Wu, X. Fang, and L.-R. Dai, “Learning contextually fused audio-visual representations for audio- visual speech recognition,” inProc. ICIP. IEEE, 2022, pp. 1346–1350
2022
-
[14]
wav2vec 2.0: A framework for self-supervised learning of speech representations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, 2020, pp. 12 449–12 460
2020
-
[15]
Wavlm: Large-scale self-supervised pre- training for full stack speech processing,
S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022
2022
-
[16]
Robust speech recognition via large-scale weak super- vision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak super- vision,” inProc. ICML, 2023, pp. 28 492–28 518
2023
-
[17]
Robust Self-Supervised Audio- Visual Speech Recognition,
B. Shi, W.-N. Hsu, and A. Mohamed, “Robust Self-Supervised Audio- Visual Speech Recognition,” inProc. Interspeech, 2022, pp. 2118–2122
2022
-
[18]
Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition,
X. Pan, P. Chen, Y . Gong, H. Zhou, X. Wang, and Z. Lin, “Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition,” inProc. ACL, 2022, pp. 4491–4503
2022
-
[19]
Self-supervised adaptive av fusion module for pre-trained asr models,
C. Simic and T. Bocklet, “Self-supervised adaptive av fusion module for pre-trained asr models,” inProc. ICASSP. IEEE, 2024, pp. 12 787– 12 791
2024
-
[20]
Whisper-flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation,
A. Rouditchenko, Y . Gong, S. Thomas, L. Karlinsky, H. Kuehne, R. Feris, and J. Glass, “Whisper-flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation,” in Proc. Interspeech, 2024, pp. 2420–2424
2024
-
[21]
Where visual speech meets language: Vsp-llm framework for efficient and context-aware visual speech processing,
J. Yeo, S. Hanet al., “Where visual speech meets language: Vsp-llm framework for efficient and context-aware visual speech processing,” in Proc. EMNLP, 2024, pp. 11 391–11 406
2024
-
[22]
Large language models are strong audio-visual speech recognition learners,
U. Cappellazzo, M. Kim, H. Chen, P. Maet al., “Large language models are strong audio-visual speech recognition learners,” inProc. ICASSP, 2025, pp. 1–5
2025
-
[23]
Mms-llama: Efficient llm- based audio-visual speech recognition with minimal multimodal speech tokens,
J. H. Yeo, H. Rha, S. J. Park, and Y . M. Ro, “Mms-llama: Efficient llm- based audio-visual speech recognition with minimal multimodal speech tokens,” inProc. ACL, 2025, pp. 20 724–20 735
2025
-
[24]
LRS3-TED: a large-scale dataset for visual speech recognition
T. Afouras, J. S. Chung, and A. Zisserman, “Lrs3-ted: a large-scale dataset for visual speech recognition,”arXiv preprint arXiv:1809.00496, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[25]
Purification before fusion: Toward mask-free speech enhance- ment for robust audio-visual speech recognition,
L. Wu, X. Zhang, H. Yuan, Y . Zhang, C. Zheng, L. Xie, T. Liu, and E. Yin, “Purification before fusion: Toward mask-free speech enhance- ment for robust audio-visual speech recognition,” inProc. ICASSP, 2026, pp. 17 932–17 936
2026
-
[26]
Attention bottlenecks for multimodal fusion,
A. Nagrani, S. Yanget al., “Attention bottlenecks for multimodal fusion,”Advances in Neural Information Processing Systems, vol. 34, 2021
2021
-
[27]
Cross-modal attention network for temporal inconsistent audio-visual event localization,
H. Xuan, Z. Zhanget al., “Cross-modal attention network for temporal inconsistent audio-visual event localization,” inProc. AAAI, vol. 34, no. 01, 2020, pp. 279–286
2020
-
[28]
Modality attention for end-to-end audio-visual speech recognition,
P. Zhou, W. Yang, W. Chen, Y . Wang, and J. Jia, “Modality attention for end-to-end audio-visual speech recognition,” inProc. ICASSP. IEEE, 2019, pp. 6565–6569
2019
-
[29]
Robust audio-visual speech recognition using bimodal dfsmn with multi-condition training and dropout regulariza- tion,
S. Zhang, M. Leiet al., “Robust audio-visual speech recognition using bimodal dfsmn with multi-condition training and dropout regulariza- tion,” inProc. ICASSP, 2019, pp. 6570–6574
2019
-
[30]
Training strategies to handle missing modalities for audio-visual expression recognition,
S. Parthasarathy and S. Sundaram, “Training strategies to handle missing modalities for audio-visual expression recognition,” inProc. ICMI, 2020, pp. 400–404
2020
-
[31]
Enhanced self-supervised multi-view rep- resentations with modality-missing robustness for audio-visual speech recognition,
F. Su, C. Li, and J. Liu, “Enhanced self-supervised multi-view rep- resentations with modality-missing robustness for audio-visual speech recognition,” inProc. ICME, 2025, pp. 1–6
2025
-
[32]
Audio-visual deep learning for noise robust speech recognition,
J. Huang and B. Kingsbury, “Audio-visual deep learning for noise robust speech recognition,” inProc. ICASSP. IEEE, 2013, pp. 7596–7599
2013
-
[33]
Deep multimodal learning for audio-visual speech recognition,
Y . Mroueh, E. Marcheret, and V . Goel, “Deep multimodal learning for audio-visual speech recognition,” inProc. ICASSP. IEEE, 2015, pp. 2130–2134
2015
-
[34]
End-to-end audio-visual speech recognition with conformers,
P. Ma, S. Petridis, and M. Pantic, “End-to-end audio-visual speech recognition with conformers,” inProc. ICASSP, 2021, pp. 7613–7617
2021
-
[35]
Lip reading sentences in the wild,
J. Son Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” inProc. CVPR, 2017, pp. 6447–6456
2017
-
[36]
Zero- avsr: Zero-shot audio-visual speech recognition with llms by learning language-agnostic speech representations,
J. H. Yeo, M. Kim, C. W. Kim, S. Petridis, and Y . M. Ro, “Zero- avsr: Zero-shot audio-visual speech recognition with llms by learning language-agnostic speech representations,” inProc. ICCV, 2025, pp. 6693–6703
2025
-
[37]
Mitigating at- tention sinks and massive activations in audio-visual speech recognition with llms,
A. Anand, U. Cappellazzo, S. Petridis, and M. Pantic, “Mitigating at- tention sinks and massive activations in audio-visual speech recognition with llms,” inProc. ICASSP. IEEE, 2026, pp. 17 942–17 946
2026
-
[38]
Multi- angle lipreading using angle classification and angle-specific feature integration,
S. Isobe, S. Tamura, S. Hayamizu, Y . Gotoh, and M. Nose, “Multi- angle lipreading using angle classification and angle-specific feature integration,” inProc. ICCSPA. IEEE, 2021, pp. 1–5
2021
-
[39]
Efficient multi-angle audio-visual speech recognition using parallel wavegan based scene classifier
S. Isobe, S. Tamura, Y . Gotoh, and M. Nose, “Efficient multi-angle audio-visual speech recognition using parallel wavegan based scene classifier.” inProc. ICPRAM, 2022, pp. 449–460
2022
-
[40]
Audio-visual speech recognition in-the-wild: Multi-angle vehicle cabin corpus and attention-based method,
A. Axyonov, D. Ryumin, D. Ivanko, A. Kashevnik, and A. Karpov, “Audio-visual speech recognition in-the-wild: Multi-angle vehicle cabin corpus and attention-based method,” inProc. ICASSP. IEEE, 2024, pp. 8195–8199
2024
-
[41]
Multi-view spectral clustering with adaptive graph learning and tensor schatten p-norm,
Y . Zhao, Y . Yun, X. Zhang, Q. Li, and Q. Gao, “Multi-view spectral clustering with adaptive graph learning and tensor schatten p-norm,” Neurocomputing, vol. 468, pp. 257–264, 2022
2022
-
[42]
Umiformer: Mining the correlations between similar tokens for multi-view 3d reconstruction,
Z. Zhu, L. Yang, N. Li, C. Jiang, and Y . Liang, “Umiformer: Mining the correlations between similar tokens for multi-view 3d reconstruction,” inProc. ICCV, 2023, pp. 18 226–18 235
2023
-
[43]
Vsformer: Mining correlations in flexible view set for multi-view 3d shape under- standing,
H. Sun, Y . Wang, P. Wang, H. Deng, X. Cai, and D. Li, “Vsformer: Mining correlations in flexible view set for multi-view 3d shape under- standing,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 4, pp. 2127–2141, 2024
2024
-
[44]
Viewclr: Learning self-supervised video representation for unseen viewpoints,
S. Das and M. S. Ryoo, “Viewclr: Learning self-supervised video representation for unseen viewpoints,” inProc. WACV, 2023, pp. 5573– 5583
2023
-
[45]
Injecting visual features into whisper for parameter-efficient noise- robust audio-visual speech recognition,
Z. Yang, Y . H. Yeo, R. Jiang, X. Fu, W. Chen, W. Xi, and J. Zhao, “Injecting visual features into whisper for parameter-efficient noise- robust audio-visual speech recognition,” inProc. ICASSP, 2025, pp. 1–5
2025
-
[46]
Audio-visual representa- tion learning via knowledge distillation from speech foundation models,
J.-X. Zhang, G. Wan, J. Gao, and Z.-H. Ling, “Audio-visual representa- tion learning via knowledge distillation from speech foundation models,” Pattern Recognition, vol. 162, p. 111432, 2025
2025
-
[47]
Improving noise robust audio-visual speech recognition via router-gated cross- modal feature fusion,
D. Lim, Y . Kim, D.-H. Kim, D.-H. Yang, and J.-H. Chang, “Improving noise robust audio-visual speech recognition via router-gated cross- modal feature fusion,” inProc. ASRU, 2025, pp. 1–7
2025
-
[48]
Multimodal machine learning: A survey and taxonomy,
T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,”IEEE transactions on pattern anal- ysis and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018
2018
-
[49]
Audio-visual speech recognition with a hybrid ctc/attention architec- ture,
S. Petridis, T. Stafylakis, P. Ma, G. Tzimiropoulos, and M. Pantic, “Audio-visual speech recognition with a hybrid ctc/attention architec- ture,” inProc. SLT. IEEE, 2018, pp. 513–520
2018
-
[50]
Disentan- gled speech embeddings using cross-modal self-supervision,
A. Nagrani, J. S. Chung, S. Albanie, and A. Zisserman, “Disentan- gled speech embeddings using cross-modal self-supervision,” inProc. ICASSP, 2020, pp. 6829–6833
2020
-
[51]
Learning audio- visual speech representation by masked multimodal cluster prediction,
B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio- visual speech representation by masked multimodal cluster prediction,” inProc. ICLR, 2022
2022
-
[52]
Videobert: A joint model for video and language representation learning,
C. Sun, A. Myers, C. V ondrick, K. Murphy, and C. Schmid, “Videobert: A joint model for video and language representation learning,” inProc. ICCV, 2019, pp. 7464–7473. 14
2019
-
[53]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProc. ICML. PmLR, 2021, pp. 8748–8763
2021
-
[54]
Flamingo: a visual language model for few-shot learning,
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millicah, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,” inProc. NeurIPS, 2022, pp. 23 716–23 736
2022
-
[55]
Out of time: automated lip sync in the wild,
J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” inProc. ACCV, 2016, pp. 251–263
2016
-
[56]
End-to-end audio-visual neural speaker diarization,
M.-K. He, J. Du, and C.-H. Lee, “End-to-end audio-visual neural speaker diarization,” inProc. Interspeech, 2022, pp. 1461–1465
2022
-
[57]
Combining multiple probability predictions using a simple logit model,
V . A. Satop ¨a¨a, J. Baron, D. P. Foster, B. A. Mellers, P. E. Tetlock, and L. H. Ungar, “Combining multiple probability predictions using a simple logit model,”International Journal of Forecasting, vol. 30, no. 2, pp. 344–356, 2014
2014
-
[58]
On giant’s shoulders: Effortless weak to strong by dynamic logits fusion,
C. Fan, Z. Lu, W. Wei, J. Tian, X. Qu, D. Chen, and Y . Cheng, “On giant’s shoulders: Effortless weak to strong by dynamic logits fusion,”Advances in Neural Information Processing Systems, vol. 37, pp. 29 986–30 014, 2024
2024
-
[59]
Hy- brid ctc/attention architecture for end-to-end speech recognition,
S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hy- brid ctc/attention architecture for end-to-end speech recognition,”IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017
2017
-
[60]
A cascade sequence-to-sequence model for chinese mandarin lip reading,
Y . Zhao, R. Xu, and M. Song, “A cascade sequence-to-sequence model for chinese mandarin lip reading,” inProc. ACM Multimedia Asia, 2019, pp. 1–6
2019
-
[61]
Audio-visual speech recognition in misp2021 challenge: Dataset release and deep analysis,
H. Chen, J. Du, Y . Dai, S. Siniscalchi, S. Watanabe, O. Scharenborg, J. Chen, J. Panet al., “Audio-visual speech recognition in misp2021 challenge: Dataset release and deep analysis,” inProc. Interspeech, vol. 2022, 2022, pp. 1766–1770
2022
-
[62]
Cn-cvs: A mandarin audio-visual dataset for large vocabulary continuous visual to speech synthesis,
C. Chen, D. Wang, and T. F. Zheng, “Cn-cvs: A mandarin audio-visual dataset for large vocabulary continuous visual to speech synthesis,” in Proc. ICASSP, 2023, pp. 1–5
2023
-
[63]
M. Anwar, B. Shi, V . Goswami, W.-N. Hsu, J. M. Pino, and C. Wang, “Muavic: A multilingual audio-visual corpus for robust speech recogni- tion and robust speech-to-text translation,”arXiv:2303.00628, 2023
-
[64]
Ave speech: A comprehensive multimodal dataset for speech recognition integrating audio, visual, and electromyographic signals,
D. Zhou, Y . Zhang, J. Wu, X. Zhang, L. Xie, and E. Yin, “Ave speech: A comprehensive multimodal dataset for speech recognition integrating audio, visual, and electromyographic signals,”IEEE Transactions on Human-Machine Systems, vol. 55, no. 4, pp. 559–568, 2025
2025
-
[65]
Sample and computation redistribution for efficient face detection,
J. Guo, J. Deng, A. Lattas, and S. Zafeiriou, “Sample and computation redistribution for efficient face detection,”arXiv:2105.04714, 2021
-
[66]
Arcface: Additive angular margin loss for deep face recognition,
J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” inProc. CVPR, 2019, pp. 4685– 4694
2019
-
[67]
M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazar´e, M. Lomeli, L. Hosseini, and H. J ´egou, “The faiss library,” arXiv:2401.08281, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
Billion-scale similarity search with GPUs,
J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535– 547, 2019
2019
-
[69]
V oxCeleb2: Deep speaker recognition,
J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep speaker recognition,” inProc. Interspeech, 2018, pp. 1086–1090
2018
-
[70]
The first multimodal information based speech processing (misp) challenge: Data, tasks, baselines and results,
H. Chen, H. Zhou, D. Jun, C.-H. Lee, J. Chen, S. Watanabe, S. M. Siniscalchi, O. Scharenborg, D.-Y . Liu, B.-C. Yinet al., “The first multimodal information based speech processing (misp) challenge: Data, tasks, baselines and results,” inProc. ICASSP, 2022, pp. 9266–9270
2022
-
[71]
Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis,
I. Anina, Z. Zhouet al., “Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis,” inProc. FG, 2015, pp. 1–5
2015
-
[72]
Recurrent neural network transducer for audio-visual speech recognition,
T. Makino, H. Liaoet al., “Recurrent neural network transducer for audio-visual speech recognition,” inProc. ASRU, 2019, pp. 905–912
2019
-
[73]
Auto-avsr: Audio-visual speech recognition with automatic labels,
P. Ma, A. Haliassoset al., “Auto-avsr: Audio-visual speech recognition with automatic labels,” inProc. ICASSP, 2023, pp. 1–5
2023
-
[74]
MUSAN: A Music, Speech, and Noise Corpus
D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,”arXiv preprint arXiv:1510.08484, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[75]
Nara- wpe: A python package for weighted prediction error dereverberation in numpy and tensorflow for online and offline processing,
L. Drude, J. Heymann, C. Boeddeker, and R. Haeb-Umbach, “Nara- wpe: A python package for weighted prediction error dereverberation in numpy and tensorflow for online and offline processing,” inSpeech Communication; 13th ITG-Symposium. VDE, 2018, pp. 1–5
2018
-
[76]
Front-end processing for the chime-5 dinner party scenario,
C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Hey- mann, and R. Haeb-Umbach, “Front-end processing for the chime-5 dinner party scenario,” inProc. CHiME 2018, 2018, pp. 35–40
2018
-
[77]
Specaugment: A simple data augmentation method for automatic speech recognition,
D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” inProc. Interspeech, 2019, pp. 2613– 2617
2019
-
[78]
Lipreading using temporal convolutional networks,
B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading using temporal convolutional networks,” inProc. ICASSP, 2020, pp. 6319– 6323
2020
-
[79]
Watch or listen: Robust audio- visual speech recognition with visual corruption modeling and reliability scoring,
J. Hong, M. Kim, J. Choi, and Y . M. Ro, “Watch or listen: Robust audio- visual speech recognition with visual corruption modeling and reliability scoring,” inProc. CVPR, 2023, pp. 18 783–18 794
2023
-
[80]
Learning video temporal dynamics with cross-modal attention for robust audio-visual speech recognition,
S. Kim, K. Jang, S. Bae, H. Kim, and S.-Y . Yun, “Learning video temporal dynamics with cross-modal attention for robust audio-visual speech recognition,” inProc. SLT, 2024, pp. 447–454
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.