pith. sign in

arxiv: 2605.26672 · v2 · pith:F4EMRQLDnew · submitted 2026-05-26 · 💻 cs.MM · cs.SD

Can We Hear from Events? Generating Speech from Event Camera

Pith reviewed 2026-06-29 14:37 UTC · model grok-4.3

classification 💻 cs.MM cs.SD
keywords event cameraspeech generationneuromorphic visionmultimodal generationemotional speechtemporal alignmentbidirectional mechanismwavelet contextualizer
0
0 comments X

The pith

Event cameras generate expressive speech by aligning microsecond visual events with acoustic waveforms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that fixed-exposure RGB cameras create a temporal mismatch that blurs the fast mouth and face movements needed for emotional speech. Neuromorphic event cameras record only changes at microsecond scale, which the authors say matches the timing of sound production. They introduce EventSpeech, a text-conditioned system with an event encoder, a wavelet-based audio encoder, and a bidirectional alignment step that links visual events to dense audio features. A new dataset called EVT-SPK supplies both synthetic and real neuromorphic recordings for training and testing. Experiments show the approach keeps finer emotional cues and avoids blur compared with RGB baselines.

Core claim

EventSpeech is a text-conditioned framework that pioneers neuromorphic events for speech generation. It pairs a dedicated Event Encoder for sparse event data with a multi-scale Audio Encoder containing a Hierarchical Wavelet Contextualizer, then uses a bidirectional alignment mechanism to synchronize linguistic content and visual dynamics with dense acoustic features. The authors state that microsecond-precise events naturally align with acoustic waveform dynamics, overcoming the temporal granularity mismatch of RGB cameras.

What carries the argument

The bidirectional alignment mechanism that synchronizes linguistic content and visual dynamics with dense acoustic features.

If this is right

  • Generated speech preserves fine-grained emotional detail that RGB methods lose to blur.
  • The system resists motion blur on high-frequency facial movements during speech.
  • EVT-SPK supplies the first large-scale benchmark mixing synthetic and real neuromorphic recordings.
  • The architecture establishes a new multimodal route for speech generation that does not rely on fixed-frame RGB input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same event-to-audio alignment could be tested on live streams where lighting changes rapidly.
  • Combining event input with standard microphones might allow hybrid systems that fall back when one sensor fails.
  • The sparse nature of events could reduce compute load for on-device speech synthesis compared with dense video frames.

Load-bearing premise

Microsecond-precise event data from neuromorphic cameras can be reliably aligned with acoustic waveform dynamics via the bidirectional mechanism without introducing artifacts or losing linguistic content.

What would settle it

A controlled comparison in which speech generated from event data shows the same loss of emotional transients or the same motion-blur artifacts as RGB-based methods under rapid articulatory motion would disprove the claimed advantage.

Figures

Figures reproduced from arXiv: 2605.26672 by Chenyang Xu, Jingping Fang, Lin Chen, Tong Zhao, Weidong Cai, Xiaoming Chen.

Figure 1
Figure 1. Figure 1: Pipeline of EventSpeech. During training (left), the framework learns highly natural speech representations guided by audio supervision and neuromorphic events, which capture high-frequency articulatory dynamics often missed by standard RGB frames. During inference (right), strict multimodal consistency accommodates both text-only generation via learned alignment priors and vision-augmented synthesis when … view at source ↗
Figure 2
Figure 2. Figure 2: EventSpeech Core Architecture. Left: Event Encoder extracts disentangled kinematics from V2E events via MHFE. Right: Audio Encoder fuses Mamba and HWC blocks for hierarchical acoustic modeling. Middle: Following InfoNCE cross-modal alignment, a VITS dual-stream framework routes latent representations via a Knowledge Bridge to a Flow Matching decoder for generation. (Inset: Spatial CNN architecture). rich t… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of Hierarchical Cross-Modal Alignment. Visual and acoustic features are projected and temporally aligned, then processed by bidirectional cross-attention (V→A and A→V). The concatenated outputs are fused via MLP to produce Falign. serve as reciprocal queries, this mechanism captures the in￾tricate interplay between lip kinematics and phonetic content. The resulting bidirectional contexts are i… view at source ↗
Figure 5
Figure 5. Figure 5: Statistics of EVT-SPK. EVT-SPK-Synth (left) comprises 36K clips (38 hours) synthesized via V2E. EVT-SPK-Real (right) contains 2.8K clips (4 hours) natively captured using a DAVIS346 camera. Both subsets guarantee HD visual quality alongside explicit emotion annotations. V. EXPERIMENT AND RESULTS A. Setup Implementation Details. We train the 113M parameter EventSpeech on most of the EVT-SPK-Synth dataset us… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of generated Mel-spectrograms between our method and competitive baselines. The highlight regions where EventSpeech exhibits superior capability in synthesizing fine-grained emotional fluctuations, natural prosodic transitions, and temporal consistency aligned with the GT. specialized video dubbing frameworks like VoiceCraft-Dub [28] and StyleDubber [9], and conventional RGB-based vi… view at source ↗
read the original abstract

Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation. Code and demo are available at https://xrfang-0102.github.io/EventSpeechWeb/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes EventSpeech, a text-conditioned framework for expressive speech generation from neuromorphic event camera data. It addresses temporal granularity mismatch in RGB cameras by using microsecond-precise events that align with acoustic dynamics. The architecture includes an Event Encoder for sparse events, a multi-scale Audio Encoder with Hierarchical Wavelet Contextualizer (HWC), and a bidirectional alignment mechanism to synchronize linguistic content with dense acoustic features. A new benchmark EVT-SPK is introduced with synthetic and real-world neuromorphic recordings. The paper claims extensive evaluations show significant outperformance over baselines in emotion preservation and motion blur resistance, establishing a new paradigm for multimodal speech generation.

Significance. If the central claims hold with supporting evidence, this could open a new direction in multimodal speech synthesis by exploiting event cameras' high temporal resolution for articulatory transients. The EVT-SPK benchmark construction is a clear positive contribution that enables future work. The offer of code and demo supports reproducibility.

major comments (2)
  1. [Bidirectional alignment mechanism] The bidirectional alignment mechanism (described in the abstract and method overview) is load-bearing for the headline claim of preserving phonemic/emotional content without artifacts, yet no equation, loss term, or invertibility argument is supplied to show that cross-attention on EVT-SPK data avoids discarding high-frequency formant information or introducing linguistic loss.
  2. [Evaluation / Experiments] The abstract asserts that 'extensive evaluations demonstrate' significant outperformance, but the provided text contains no quantitative metrics, error bars, ablation tables, or baseline comparisons; without these, the central claim of establishing a new paradigm cannot be assessed.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., emotion preservation metric or WER) to ground the outperformance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point-by-point below and will revise the paper to strengthen the technical presentation and clarity of results.

read point-by-point responses
  1. Referee: [Bidirectional alignment mechanism] The bidirectional alignment mechanism (described in the abstract and method overview) is load-bearing for the headline claim of preserving phonemic/emotional content without artifacts, yet no equation, loss term, or invertibility argument is supplied to show that cross-attention on EVT-SPK data avoids discarding high-frequency formant information or introducing linguistic loss.

    Authors: We agree that a formal description of the bidirectional alignment is necessary to substantiate the claims. In the revised manuscript we will add the explicit cross-attention equations, the composite loss (alignment loss plus reconstruction and perceptual terms), and a short information-preservation argument that leverages the microsecond temporal resolution of events together with the multi-scale wavelet features in HWC. This addition will directly address concerns about high-frequency formant retention. revision: yes

  2. Referee: [Evaluation / Experiments] The abstract asserts that 'extensive evaluations demonstrate' significant outperformance, but the provided text contains no quantitative metrics, error bars, ablation tables, or baseline comparisons; without these, the central claim of establishing a new paradigm cannot be assessed.

    Authors: The full manuscript contains Section 4 with quantitative tables (MOS, emotion classification accuracy, F0 RMSE, etc.), error bars from five random seeds, ablation studies on the alignment module and HWC, and direct comparisons against RGB baselines. We will ensure these results are explicitly referenced from the abstract and method overview in the revision and will add a short summary table in the main text if the review excerpt omitted the experimental section. revision: partial

Circularity Check

0 steps flagged

No circularity: architecture and benchmark described without self-referential derivations

full rationale

The abstract and provided text introduce EventSpeech with an Event Encoder, Hierarchical Wavelet Contextualizer, and bidirectional alignment mechanism, plus the EVT-SPK benchmark. No equations, loss terms, or derivation steps are shown that reduce any prediction to fitted inputs by construction. No self-citations appear, and claims rest on empirical outperformance rather than definitional equivalence or imported uniqueness theorems. The alignment is asserted to synchronize content but is not exhibited as reducing to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, training details, or dataset statistics, so the ledger cannot list concrete free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5705 in / 1063 out tokens · 25634 ms · 2026-06-29T14:37:51.639074+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 8 canonical work pages · 5 internal anchors

  1. [1]

    Matcha-tts: A fast tts architecture with conditional flow matching,

    S. Mehta, R. Tu, J. Beskow, ´E. Sz ´ekely, and G. E. Henter, “Matcha-tts: A fast tts architecture with conditional flow matching,” inICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 11341–11345, IEEE, 2024

  2. [2]

    Naturalspeech: End-to-end text-to-speech synthesis with human-level quality,

    X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y . Liu, X. Wang, Y . Leng, Y . Yi, L. He,et al., “Naturalspeech: End-to-end text-to-speech synthesis with human-level quality,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, pp. 4234–4245, 2024

  3. [3]

    Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,

    J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” inInternational conference on machine learning, pp. 5530–5540, PMLR, 2021

  4. [4]

    Imaginary voice: Face-styled diffusion model for text-to-speech,

    J. Lee, J. S. Chung, and S.-W. Chung, “Imaginary voice: Face-styled diffusion model for text-to-speech,” inICASSP 2023-2023 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, IEEE, 2023

  5. [5]

    V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models,

    H. Wang, J. Ma, S. Pascual, R. Cartwright, and W. Cai, “V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 15492–15501, 2024

  6. [6]

    Vall-e 2: Neural codec language models are human parity zero-shot text to speech synthesizers.arXiv preprint arXiv:2406.05370, 2024

    S. Chen, S. Liu, L. Zhou, Y . Liu, X. Tan, J. Li, S. Zhao, Y . Qian, and F. Wei, “Vall-e 2: Neural codec language models are human parity zero-shot text to speech synthesizers,”arXiv preprint arXiv:2406.05370, 2024

  7. [7]

    Glow-tts: A generative flow for text-to-speech via monotonic alignment search,

    J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,”Advances in Neural Information Processing Systems, vol. 33, pp. 8067–8077, 2020

  8. [8]

    Dcptalk: Speech- driven 3d face animation with personalized facial dynamic coupling properties,

    Z. Chu, K. Guo, X. Xing, P. Liu, B. Cai, and X. Xu, “Dcptalk: Speech- driven 3d face animation with personalized facial dynamic coupling properties,”IEEE Transactions on Multimedia, vol. 27, pp. 4427–4440, 2025. IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 8, MAY 2026 9

  9. [9]

    Styledubber: Towards multi-scale style learning for movie dubbing,

    G. Cong, Y . Qi, L. Li, A. Beheshti, Z. Zhang, A. Hengel, M.-H. Yang, C. Yan, and Q. Huang, “Styledubber: Towards multi-scale style learning for movie dubbing,” inFindings of the Association for Computational Linguistics: ACL 2024, pp. 6767–6779, 2024

  10. [10]

    Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners,

    Y . Xing, Y . He, Z. Tian, X. Wang, and Q. Chen, “Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7151–7161, 2024

  11. [11]

    Tell what you hear from what you see-video to audio generation through text,

    X. Liu, K. Su, and E. Shlizerman, “Tell what you hear from what you see-video to audio generation through text,”Advances in Neural Information Processing Systems, vol. 37, pp. 101337–101366, 2024

  12. [12]

    Learning dual modality interactions for event-based motion deblurring,

    Z. Xiao, Z. Li, Y . Zhao, Y . Liu, Z. Zhang, and W. Jia, “Learning dual modality interactions for event-based motion deblurring,”IEEE Transactions on Multimedia, 2026

  13. [13]

    v2e: From video frames to realistic dvs events,

    Y . Hu, S.-C. Liu, and T. Delbruck, “v2e: From video frames to realistic dvs events,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1312–1321, 2021

  14. [14]

    Recent event camera innovations: A survey,

    B. Chakravarthi, A. A. Verma, K. Daniilidis, C. Fermuller, and Y . Yang, “Recent event camera innovations: A survey,” inEuropean conference on computer vision, pp. 342–376, Springer, 2024

  15. [15]

    Eventlip: Enhancing event-based lip reading via frequency-aware spatiotemporal hypergraph modeling,

    X. Zhang, J. Sun, C. Zhang, X. Yue, T. Xiao, S. Cai, M. Lao, and H. Li, “Eventlip: Enhancing event-based lip reading via frequency-aware spatiotemporal hypergraph modeling,” inProceedings of the 33rd ACM International Conference on Multimedia, pp. 8263–8272, 2025

  16. [16]

    Event- based low-illumination image enhancement,

    Y . Jiang, Y . Wang, S. Li, Y . Zhang, M. Zhao, and Y . Gao, “Event- based low-illumination image enhancement,”IEEE Transactions on Multimedia, vol. 26, pp. 1920–1931, 2023

  17. [17]

    E-mlb: Multilevel benchmark for event-based camera denoising,

    S. Ding, J. Chen, Y . Wang, Y . Kang, W. Song, J. Cheng, and Y . Cao, “E-mlb: Multilevel benchmark for event-based camera denoising,”IEEE Transactions on Multimedia, vol. 26, pp. 65–76, 2023

  18. [18]

    Mead: A large-scale audio-visual dataset for emotional talking-face generation,

    K. Wang, Q. Wu, L. Song, Z. Yang, W. Wu, C. Qian, R. He, Y . Qiao, and C. C. Loy, “Mead: A large-scale audio-visual dataset for emotional talking-face generation,” inEuropean conference on computer vision, pp. 700–717, Springer, 2020

  19. [19]

    The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,

    S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,”PloS one, vol. 13, no. 5, p. e0196391, 2018

  20. [20]

    Tacotron: Towards End-to-End Speech Synthesis

    Y . Wang, R. Skerry-Ryan, D. Stanton, Y . Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y . Xiao, Z. Chen, S. Bengio,et al., “Tacotron: Towards end- to-end speech synthesis,”arXiv preprint arXiv:1703.10135, 2017

  21. [21]

    Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,

    J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerrv-Ryan,et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4779–4783, IEEE, 2018

  22. [22]

    Grad- tts: A diffusion probabilistic model for text-to-speech,

    V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, and M. Kudinov, “Grad- tts: A diffusion probabilistic model for text-to-speech,” inInternational conference on machine learning, pp. 8599–8608, PMLR, 2021

  23. [23]

    DiffWave: A Versatile Diffusion Model for Audio Synthesis

    Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Dif- fwave: A versatile diffusion model for audio synthesis,”arXiv preprint arXiv:2009.09761, 2020

  24. [24]

    F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,

    Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6255–6271, 2025

  25. [25]

    Ditto-tts: Dif- fusion transformers for scalable text-to-speech without domain-specific factors,

    K. Lee, D. W. Kim, J. Kim, S. Chung, and J. Cho, “Ditto-tts: Dif- fusion transformers for scalable text-to-speech without domain-specific factors,” inInternational Conference on Learning Representations, vol. 2025, pp. 52022–52055, 2025

  26. [26]

    More than words: In-the-wild visually-driven prosody for text-to-speech,

    M. Hassid, M. T. Ramanovich, B. Shillingford, M. Wang, Y . Jia, and T. Remez, “More than words: In-the-wild visually-driven prosody for text-to-speech,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10587–10597, 2022

  27. [27]

    Realistic speech-driven facial animation with gans,

    K. V ougioukas, S. Petridis, and M. Pantic, “Realistic speech-driven facial animation with gans,”International Journal of Computer Vision, vol. 128, no. 5, pp. 1398–1413, 2020

  28. [28]

    V oicecraft-dub: Automated video dubbing with neural codec language models,

    K. Sung-Bin, J. Choi, P. Peng, J. S. Chung, T.-H. Oh, and D. Harwath, “V oicecraft-dub: Automated video dubbing with neural codec language models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14623–14632, 2025

  29. [29]

    Learning to dub movies via hierarchical prosody models,

    G. Cong, L. Li, Y . Qi, Z.-J. Zha, Q. Wu, W. Wang, B. Jiang, M.-H. Yang, and Q. Huang, “Learning to dub movies via hierarchical prosody models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14687–14697, 2023

  30. [30]

    Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds,

    Y . Zhang, Y . Gu, Y . Zeng, Z. Xing, Y . Wang, Z. Wu, B. Liu, and K. Chen, “Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds,”International Journal of Computer Vision, vol. 134, no. 1, p. 46, 2026

  31. [31]

    Diff-foley: Synchronized video- to-audio synthesis with latent diffusion models,

    S. Luo, C. Yan, C. Hu, and H. Zhao, “Diff-foley: Synchronized video- to-audio synthesis with latent diffusion models,”Advances in Neural Information Processing Systems, vol. 36, pp. 48855–48876, 2023

  32. [32]

    Taro: Timestep-adaptive rep- resentation alignment with onset-aware conditioning for synchronized video-to-audio synthesis,

    T. Ton, J. W. Hong, and C. D. Yoo, “Taro: Timestep-adaptive rep- resentation alignment with onset-aware conditioning for synchronized video-to-audio synthesis,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14228–14237, 2025

  33. [33]

    Audio-visual controlled video diffusion with masked selective state spaces modeling for natural talking head generation,

    F.-T. Hong, Z. Xu, Z. Zhou, J. Zhou, X. Li, Q. Lin, Q. Lu, and D. Xu, “Audio-visual controlled video diffusion with masked selective state spaces modeling for natural talking head generation,” inProceed- ings of the IEEE/CVF International Conference on Computer Vision, pp. 12549–12558, 2025

  34. [34]

    Float: Generative motion latent flow match- ing for audio-driven talking portrait,

    T. Ki, D. Min, and G. Chae, “Float: Generative motion latent flow match- ing for audio-driven talking portrait,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14699–14710, 2025

  35. [35]

    High speed and high dynamic range video with an event camera,

    H. Rebecq, R. Ranftl, V . Koltun, and D. Scaramuzza, “High speed and high dynamic range video with an event camera,”IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 6, pp. 1964–1980, 2019

  36. [36]

    Physics- driven diffusion models for impact sound synthesis from videos,

    K. Su, K. Qian, E. Shlizerman, A. Torralba, and C. Gan, “Physics- driven diffusion models for impact sound synthesis from videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9749–9759, 2023

  37. [37]

    Isolated single sound lip- reading using a frame-based camera and event-based camera,

    T. Kanamaru, T. Arakane, and T. Saitoh, “Isolated single sound lip- reading using a frame-based camera and event-based camera,”Frontiers in Artificial Intelligence, vol. 5, p. 1070964, 2023

  38. [38]

    Multi-grained spatio-temporal features perceived network for event-based lip-reading,

    G. Tan, Y . Wang, H. Han, Y . Cao, F. Wu, and Z.-J. Zha, “Multi-grained spatio-temporal features perceived network for event-based lip-reading,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20094–20103, 2022

  39. [39]

    Collaborative viseme subword and end-to-end modeling for word-level lip reading,

    H. Chen, Q. Wang, J. Du, G.-S. Wan, S.-F. Xiong, B.-C. Yin, J. Pan, and C.-H. Lee, “Collaborative viseme subword and end-to-end modeling for word-level lip reading,”IEEE Transactions on Multimedia, vol. 26, pp. 9358–9371, 2024

  40. [40]

    Audio-visual target speaker enhancement on multi-talker environment using event-driven cameras,

    A. Arriandiaga, G. Morrone, L. Pasa, L. Badino, and C. Bartolozzi, “Audio-visual target speaker enhancement on multi-talker environment using event-driven cameras,” in2021 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5, IEEE, 2021

  41. [41]

    Convnext v2: Co-designing and scaling convnets with masked autoencoders,

    S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “Convnext v2: Co-designing and scaling convnets with masked autoencoders,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16133–16142, 2023

  42. [42]

    Maxim: Multi-axis mlp for image processing,

    Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y . Li, “Maxim: Multi-axis mlp for image processing,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5769–5780, 2022

  43. [43]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

  44. [44]

    BigVGAN: A Universal Neural Vocoder with Large-Scale Training.arXiv preprint arXiv:2206.04658, 2022

    S.-g. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “Bigvgan: A universal neural vocoder with large-scale training,”arXiv preprint arXiv:2206.04658, 2022

  45. [45]

    Simple and controllable music generation,

    J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y . Adi, and A. D ´efossez, “Simple and controllable music generation,”Advances in neural information processing systems, vol. 36, pp. 47704–47720, 2023

  46. [46]

    Representation Learning with Contrastive Predictive Coding

    A. v. d. Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018

  47. [47]

    Flow Matching for Generative Modeling

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

  48. [48]

    Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,

    J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,”Advances in neural information processing systems, vol. 33, pp. 17022–17033, 2020

  49. [49]

    From faces to voices: Learning hierarchical representations for high-quality video-to- speech,

    J.-H. Kim, J. Choi, J. Kim, C. Jung, and J. S. Chung, “From faces to voices: Learning hierarchical representations for high-quality video-to- speech,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 15874–15884, 2025

  50. [50]

    Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,

    K. Zhou, B. Sisman, R. Liu, and H. Li, “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 920–924, IEEE, 2021

  51. [51]

    Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph,

    A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, “Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph,” inProceedings of the 56th Annual IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 8, MAY 2026 10 Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2236–2246, 2018

  52. [52]

    Mmaudio: Taming multimodal joint training for high- quality video-to-audio synthesis,

    H. K. Cheng, M. Ishii, A. Hayakawa, T. Shibuya, A. Schwing, and Y . Mitsufuji, “Mmaudio: Taming multimodal joint training for high- quality video-to-audio synthesis,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 28901–28911, 2025

  53. [53]

    Contentvec: An improved self-supervised speech representation by disentangling speakers,

    K. Qian, Y . Zhang, H. Gao, J. Ni, C.-I. Lai, D. Cox, M. Hasegawa- Johnson, and S. Chang, “Contentvec: An improved self-supervised speech representation by disentangling speakers,” inInternational con- ference on machine learning, pp. 18003–18017, PMLR, 2022

  54. [54]

    Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,

    Y . A. Li, C. Han, V . Raghavan, G. Mischler, and N. Mesgarani, “Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,”Advances in neural information processing systems, vol. 36, pp. 19594–19621, 2023

  55. [55]

    On the audio-visual synchronization for lip- to-speech synthesis,

    Z. Niu and B. Mak, “On the audio-visual synchronization for lip- to-speech synthesis,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7843–7852, 2023

  56. [56]

    A lip sync expert is all you need for speech to lip generation in the wild,

    K. Prajwal, R. Mukhopadhyay, V . P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM international conference on multimedia, pp. 484–492, 2020

  57. [57]

    Audio-visual speech representation expert for enhanced talking face video generation and evaluation,

    D. Yaman, F. I. Eyiokur, L. B ¨armann, S. Akti, H. K. Ekenel, and A. Waibel, “Audio-visual speech representation expert for enhanced talking face video generation and evaluation,” inProceedings Of The IEEE/CVF Conference On Computer Vision And Pattern Recognition, pp. 6003–6013, 2024

  58. [58]

    Mega-tts 2: Boosting prompting mechanisms for zero-shot speech synthesis,

    Z. Jiang, J. Liu, Y . Ren, J. He, Z. Ye, S. Ji, Q. Yang, C. Zhang, P. Wei, C. Wang,et al., “Mega-tts 2: Boosting prompting mechanisms for zero-shot speech synthesis,” inInternational Conference on Learning Representations, vol. 2024, pp. 57919–57939, 2024

  59. [59]

    Indicvoices- r: Unlocking a massive multilingual multi-speaker speech corpus for scaling indian tts,

    A. Sankar, S. Anand, P. S. Varadhan, S. Thomas, M. Singal, S. Kumar, D. Mehendale, A. Krishana, G. Raju, and M. Khapra, “Indicvoices- r: Unlocking a massive multilingual multi-speaker speech corpus for scaling indian tts,”Advances in Neural Information Processing Systems, vol. 37, pp. 68161–68182, 2024

  60. [60]

    Styletts: A style-based generative model for natural and diverse text-to-speech synthesis,

    Y . A. Li, C. Han, and N. Mesgarani, “Styletts: A style-based generative model for natural and diverse text-to-speech synthesis,”IEEE Journal of Selected Topics in Signal Processing, vol. 19, no. 1, pp. 283–296, 2025

  61. [61]

    Paralinguistics-aware speech-empowered large language models for natural conversation,

    H. Kim, S. Seo, K. Jeong, O. Kwon, S. Kim, J. Kim, J. Lee, E. Song, M. Oh, J.-W. Ha,et al., “Paralinguistics-aware speech-empowered large language models for natural conversation,”Advances in Neural Information Processing Systems, vol. 37, pp. 131072–131103, 2024

  62. [62]

    Naturalspeech 3: Zero-shot speech syn- thesis with factorized codec and diffusion models,

    Z. Ju, Y . Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y . Liu, Y . Leng, K. Song, S. Tang,et al., “Naturalspeech 3: Zero-shot speech syn- thesis with factorized codec and diffusion models,”arXiv preprint arXiv:2403.03100, 2024. Fang Jingpingreceived the B.S. degree in Computer Science and Technology from Zhengzhou Univer- sity of Industrial Technology, Zheng...