pith. machine review for the scientific record. sign in

arxiv: 2604.15923 · v1 · submitted 2026-04-17 · 💻 cs.SD · cs.CV

Recognition: unknown

Hierarchical Codec Diffusion for Video-to-Speech Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:08 UTC · model grok-4.3

classification 💻 cs.SD cs.CV
keywords video-to-speechdiffusion transformerresidual vector quantizationhierarchical modelingaudio-visual alignmentspeech synthesisprosody generation
0
0 comments X

The pith

A diffusion transformer generates speech from silent video by aligning visual cues to different levels of a speech codec hierarchy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video-to-speech methods have treated speech as a single block, missing that it naturally breaks into coarse speaker identity and finer prosodic timing. This paper shows that the built-in levels of residual vector quantization tokens let visual information from a video match the right speech layer directly. Low-level tokens receive conditioning from lip motion and face identity to set the content, while high-level tokens receive expression cues to shape delivery. A dual-scale normalization layer blends overall voice character with moment-to-moment changes. If the separation holds, discrete token models can produce speech that follows the video more closely in both accuracy and liveliness than earlier flat approaches.

Core claim

HiCoDiT generates discrete speech tokens level by level from video input. Lower-level blocks condition on lip-synchronized motion and facial identity to capture speaker-aware semantics, while higher-level blocks use facial expression to modulate prosodic dynamics. A dual-scale adaptive instance layer normalization captures global vocal style through channel-wise normalization and local prosody dynamics through temporal-wise normalization. This hierarchical conditioning produces stronger audio-visual alignment than prior methods.

What carries the argument

The Hierarchical Codec Diffusion Transformer (HiCoDiT) with level-specific blocks and dual-scale adaptive instance layer normalization, which matches coarse visual cues to lower RVQ tokens and fine visual cues to higher RVQ tokens.

If this is right

  • Stronger direct matching of visual motion to speech content and prosody at the exact token granularity where each property lives.
  • Generation of speech without any audio signal at inference time while still preserving speaker identity and rhythmic detail.
  • Demonstration that discrete codec tokens support more precise property alignment than continuous waveform or spectrogram methods.
  • Improved handling of facial expression effects on speech timing and intonation through dedicated high-level blocks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same level-wise conditioning pattern could be tested on video-to-music or video-to-sound-effect tasks where audio also has coarse identity and fine timing layers.
  • If the RVQ separation is stable, future work could try learning which visual features route to which level rather than fixing the assignment by hand.
  • Performance on videos with rapid expression changes or multiple speakers would test how well the assumed semantic split survives outside controlled talking-head data.

Load-bearing premise

That lower levels of the RVQ codec naturally encode coarse speaker semantics while higher levels encode fine prosody, so visual signals can be applied separately at each level without audio input.

What would settle it

An ablation that removes level-specific conditioning and applies the same visual features to every token level, then measures whether objective fidelity scores and subjective expressiveness ratings drop compared with the full model.

Figures

Figures reproduced from arXiv: 2604.15923 by Boyuan Cao, Chenhui Wang, Gaoxiang Cong, Hongming Shan, Jiaxin Ye, Xin-Cheng Wen, Zhaoyang Li.

Figure 1
Figure 1. Figure 1: Overall framework of HiCoDiT. We formulate video-to-speech generation as a hierarchical masked token prediction task. Speech is tokenized using an RVQ codec and split into low-level components x r1:r2 t and high-level components x r3:r12 t , reflecting the intrinsic hierarchy of speech tokens. Guided by this structure, we disentangle visual features from the input video V into lip motion clip, identity cid… view at source ↗
Figure 2
Figure 2. Figure 2: Hierarchy analysis of speech token. (a) RVQ codec encodes and decodes speech through multiple VQ layers. (b) The x-axis denotes cumulative decoding across token levels, and the y-axis reports scores for semantic fidelity, timbre similarity, and prosody quality. It can be observed that speaker-aware semantics improvements concentrate in lower layers, while prosody gains emerge in higher layers. These condit… view at source ↗
Figure 3
Figure 3. Figure 3: The visualization of the mel-spectrograms of ground truth (GT) and synthesized speech obtained by different models. As highlighted in the red boxes, the spectrograms generated by our method exhibit higher clarity with improved signal-to-noise ratio. jective evaluation results on the LRS3 and LRS2 datasets, respectively. Although our HiCoDiT is not trained on ei￾ther LRS3 or LRS2, it achieves leading perfor… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of generated Mels on real-world film data. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Video-to-Speech (VTS) generation aims to synthesize speech from a silent video without auditory signals. However, existing VTS methods disregard the hierarchical nature of speech, which spans coarse speaker-aware semantics to fine-grained prosodic details. This oversight hinders direct alignment between visual and speech features at specific hierarchical levels during property matching. In this paper, leveraging the hierarchical structure of Residual Vector Quantization (RVQ)-based codec, we propose HiCoDiT, a novel Hierarchical Codec Diffusion Transformer that exploits the inherent hierarchy of discrete speech tokens to achieve strong audio-visual alignment. Specifically, since lower-level tokens encode coarse speaker-aware semantics and higher-level tokens capture fine-grained prosody, HiCoDiT employs low-level and high-level blocks to generate tokens at different levels. The low-level blocks condition on lip-synchronized motion and facial identity to capture speaker-aware content, while the high-level blocks use facial expression to modulate prosodic dynamics. Finally, to enable more effective coarse-to-fine conditioning, we propose a dual-scale adaptive instance layer normalization that jointly captures global vocal style through channel-wise normalization and local prosody dynamics through temporal-wise normalization. Extensive experiments demonstrate that HiCoDiT outperforms baselines in fidelity and expressiveness, highlighting the potential of discrete modelling for VTS. The code and speech demo are both available at https://github.com/Jiaxin-Ye/HiCoDiT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes HiCoDiT, a Hierarchical Codec Diffusion Transformer for video-to-speech (VTS) synthesis. It exploits the hierarchical structure of RVQ-based discrete speech codecs by routing low-level tokens (assumed to encode coarse speaker-aware semantics) through blocks conditioned on lip motion and facial identity, while high-level tokens (assumed to encode fine prosody) are conditioned on facial expressions. A dual-scale adaptive instance normalization (AdaIN) is introduced to jointly model global vocal style and local temporal dynamics. The work claims that this level-specific conditioning yields superior fidelity and expressiveness over flat baselines, with code and demos released.

Significance. If the central hierarchy assumption holds and the reported gains are reproducible, the approach would strengthen discrete-token methods for VTS by enabling more direct visual-to-semantic alignment at multiple scales. The public release of code and speech samples is a clear positive that supports reproducibility.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Method): The premise that 'lower-level tokens encode coarse speaker-aware semantics and higher-level tokens capture fine-grained prosody' is asserted without codec-specific validation (e.g., no token visualizations, mutual-information analysis, or ablation on RVQ levels when audio is absent). This separation directly justifies the dual conditioning paths and dual-scale AdaIN; if it does not hold for the chosen codec, the claimed audio-visual alignment advantage over flat baselines is unsupported.
  2. [§4] §4 (Experiments): The abstract states that 'extensive experiments demonstrate outperformance,' yet no quantitative tables, specific baselines, metrics (e.g., WER, MOS, F0 correlation), or statistical significance tests are referenced in the provided text. Without these, the strength of the fidelity/expressiveness claims cannot be assessed.
minor comments (2)
  1. [Figure 1 and §3.2] Figure 1 and §3.2: The architecture diagram and block descriptions would benefit from explicit labeling of which RVQ levels are routed to low-level vs. high-level blocks and how the dual-scale AdaIN is applied temporally vs. channel-wise.
  2. [Notation] Notation: Define the exact RVQ codebook sizes and the number of levels used in the hierarchical diffusion process to avoid ambiguity when reproducing the conditioning scheme.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the validation of our core assumptions and the clarity of our experimental claims.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method): The premise that 'lower-level tokens encode coarse speaker-aware semantics and higher-level tokens capture fine-grained prosody' is asserted without codec-specific validation (e.g., no token visualizations, mutual-information analysis, or ablation on RVQ levels when audio is absent). This separation directly justifies the dual conditioning paths and dual-scale AdaIN; if it does not hold for the chosen codec, the claimed audio-visual alignment advantage over flat baselines is unsupported.

    Authors: We appreciate this observation. While the hierarchical properties of RVQ (lower levels capturing broader semantics and higher levels refining prosodic details) are established in the speech codec literature (e.g., EnCodec and SoundStream papers), we acknowledge the value of codec-specific validation for our setup. In the revised manuscript we will add (i) visualizations of RVQ token distributions, (ii) mutual-information analysis between individual RVQ levels and visual features, and (iii) an ablation that conditions only on visual inputs (audio absent) to empirically support the level-specific separation. These additions will directly bolster the justification for the dual conditioning paths and dual-scale AdaIN. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract states that 'extensive experiments demonstrate outperformance,' yet no quantitative tables, specific baselines, metrics (e.g., WER, MOS, F0 correlation), or statistical significance tests are referenced in the provided text. Without these, the strength of the fidelity/expressiveness claims cannot be assessed.

    Authors: We apologize if the experimental details were not sufficiently cross-referenced in the text the referee received. Section 4 of the full manuscript already contains quantitative tables reporting WER, MOS, F0 correlation, additional fidelity metrics, comparisons against multiple baselines, and statistical significance tests. In the revision we will explicitly cite these results from §4 in both the abstract and §3, and ensure all tables are prominently referenced so that the outperformance claims are immediately verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity; architecture rests on explicit upfront assumption about RVQ hierarchy rather than self-referential derivation

full rationale

The paper states its core premise directly in the abstract and method section: lower RVQ levels encode coarse speaker semantics while higher levels encode prosody. It then builds level-specific conditioning blocks and dual-scale AdaIN on top of that premise. This is an assumption, not a derivation that reduces to its own outputs or fitted parameters. No equations are shown to be equivalent by construction, no self-citations form a load-bearing chain for the hierarchy claim, and no 'prediction' is statistically forced from a subset of the same data. The empirical results (outperformance on fidelity/expressiveness) are presented as validation of the design choice, not as tautological confirmation. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the pre-trained RVQ codec providing usable hierarchical semantics and on standard diffusion training assumptions; no new free parameters are explicitly introduced or fitted in the abstract description.

axioms (1)
  • domain assumption Lower-level RVQ tokens encode coarse speaker-aware semantics and higher-level tokens capture fine-grained prosody
    This premise directly motivates the low-level and high-level blocks and is stated as the basis for the architecture.

pith-pipeline@v0.9.0 · 5565 in / 1145 out tokens · 43363 ms · 2026-05-10T08:08:39.503029+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 12 canonical work pages

  1. [1]

    LRS3-TED: a large- scale dataset for visual speech recognition,

    Triantafyllos Afouras, Joon Son Chung, and Andrew Zisser- man. LRS3-TED: A large-scale dataset for visual speech recognition.arXiv preprint:1809.00496, 2018. 5

  2. [2]

    Johnson, Jonathan Ho, Daniel Tar- low, and Rianne van den Berg

    Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tar- low, and Rianne van den Berg. Structured denoising diffu- sion models in discrete state-spaces. InAdv. Neural Inform. Process. Syst., pages 17981–17993, 2021. 2

  3. [3]

    Re- pLDM: Reprogramming pretrained latent diffusion models for high-quality, high-efficiency, high-resolution image gen- eration

    Boyuan Cao, Jiaxin Ye, Yujie Wei, and Hongming Shan. Re- pLDM: Reprogramming pretrained latent diffusion models for high-quality, high-efficiency, high-resolution image gen- eration. InAdv. Neural Inform. Process. Syst.2

  4. [4]

    V2C: Visual voice cloning

    Qi Chen, Mingkui Tan, Yuankai Qi, Jiaqiu Zhou, Yuanqing Li, and Qi Wu. V2C: Visual voice cloning. InIEEE Conf. Comput. Vis. Pattern Recog., pages 21210–21219, 2022. 1, 5

  5. [5]

    DiffV2S: Diffusion-based video-to-speech synthesis with vision- guided speaker embedding

    Jeongsoo Choi, Joanna Hong, and Yong Man Ro. DiffV2S: Diffusion-based video-to-speech synthesis with vision- guided speaker embedding. InInt. Conf. Comput. Vis., pages 7778–7787, 2023. 1, 2, 6

  6. [6]

    AlignDiT: Multimodal aligned dif- fusion transformer for synchronized speech generation

    Jeongsoo Choi, Ji-Hoon Kim, Kim Sung-Bin, Tae-Hyun Oh, and Joon Son Chung. AlignDiT: Multimodal aligned dif- fusion transformer for synchronized speech generation. In ACM Int. Conf. Multimedia, 2025. 6, 7

  7. [7]

    Out of time: Au- tomated lip sync in the wild

    Joon Son Chung and Andrew Zisserman. Out of time: Au- tomated lip sync in the wild. InACCV . Int. Worksh., pages 251–263, 2016. 5

  8. [8]

    V oxceleb2: Deep speaker recognition,

    Joon Son Chung, Arsha Nagrani, and Andrew Zisser- man. V oxceleb2: Deep speaker recognition.arXiv preprint:1806.05622, 2018. 5

  9. [9]

    Learning to dub movies via hierarchical prosody models

    Gaoxiang Cong, Liang Li, Yuankai Qi, Zheng-Jun Zha, Qi Wu, Wenyu Wang, Bin Jiang, Ming-Hsuan Yang, and Qingming Huang. Learning to dub movies via hierarchical prosody models. InIEEE Conf. Comput. Vis. Pattern Recog., pages 14687–14697, 2023. 1

  10. [10]

    FlowDubber: Movie dubbing with llm- based semantic-aware learning and flow matching based voice enhancing

    Gaoxiang Cong, Liang Li, Jiadong Pan, Zhedong Zhang, Amin Beheshti, Anton van den Hengel, Yuankai Qi, and Qingming Huang. FlowDubber: Movie dubbing with llm- based semantic-aware learning and flow matching based voice enhancing. InACM Int. Conf. Multimedia, pages 905– 914, 2025. 1

  11. [11]

    EmoDubber: Towards high quality and emotion con- trollable movie dubbing

    Gaoxiang Cong, Jiadong Pan, Liang Li, Yuankai Qi, Yuxin Peng, Anton van den Hengel, Jian Yang, and Qingming Huang. EmoDubber: Towards high quality and emotion con- trollable movie dubbing. InIEEE Conf. Comput. Vis. Pattern Recog., pages 15863–15873, 2025. 6

  12. [12]

    A study of dropout- induced modality bias on robustness to missing video frames for audio-visual speech recognition

    Yusheng Dai, Hang Chen, Jun Du, Ruoyu Wang, Shihao Chen, Haotian Wang, and Chin-Hui Lee. A study of dropout- induced modality bias on robustness to missing video frames for audio-visual speech recognition. InIEEE Conf. Comput. Vis. Pattern Recog., pages 27435–27445, 2024. 3

  13. [13]

    Visual-aware speech recog- nition for noisy scenarios

    Balaji Darur and Karan Singla. Visual-aware speech recog- nition for noisy scenarios. InProc. Conf. Empir. Methods Natural Lang. Process., pages 16709–16717, 2025. 1

  14. [14]

    High fidelity neural audio compression.Trans

    Alexandre D ´efossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.Trans. Mach. Learn. Res., 2023, 2023. 2, 3

  15. [15]

    Arcface: Additive angular mar- gin loss for deep face recognition.IEEE Trans

    Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Kot- sia, and Stefanos Zafeiriou. Arcface: Additive angular mar- gin loss for deep face recognition.IEEE Trans. Pattern Anal. Mach. Intell., 44(10):5962–5979, 2022. 3

  16. [16]

    MixGAN-TTS: Efficient and stable speech synthesis based on diffusion model.IEEE Access, 11:57674–57682,

    Yan Deng, Ning Wu, Chengjun Qiu, Yangyang Luo, and Yan Chen. MixGAN-TTS: Efficient and stable speech synthesis based on diffusion model.IEEE Access, 11:57674–57682,

  17. [17]

    ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker veri- fication

    Brecht Desplanques, Jenthe Thienpondt, and Kris De- muynck. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker veri- fication. InAnnu. Conf. Int. Speech Commun. Assoc., pages 3830–3834, 2020. 5

  18. [18]

    Self-evolving embodied ai.arXiv preprint:2602.04411, 2026

    Tongtong Feng, Xin Wang, and Wenwu Zhu. Self-evolving embodied ai.arXiv preprint:2602.04411, 2026. 1

  19. [19]

    Communicating emotion: The role of prosodic features.Psychological bulletin, 97(3):412, 1985

    Robert W Frick. Communicating emotion: The role of prosodic features.Psychological bulletin, 97(3):412, 1985. 3

  20. [20]

    Face2Speech: Towards multi-speaker text-to-speech synthesis using an embedding vector predicted from a face image

    Shunsuke Goto, Kotaro Onishi, Yuki Saito, Kentaro Tachibana, and Koichiro Mori. Face2Speech: Towards multi-speaker text-to-speech synthesis using an embedding vector predicted from a face image. InAnnu. Conf. Int. Speech Commun. Assoc., pages 1321–1325, 2020. 1, 2

  21. [21]

    Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, and Ruoming Pang

    Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, and Ruoming Pang. Hierar- chical generative modeling for controllable speech synthesis. InInt. Conf. Learn. Represent., 2019. 2

  22. [22]

    Faces that speak: Jointly synthesising talking face and speech from text

    Youngjoon Jang, Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, et al. Faces that speak: Jointly synthesising talking face and speech from text. InIEEE Conf. Comput. Vis. Pattern Recog., pages 8818–8828, 2024. 1

  23. [23]

    Face- StyleSpeech: Improved face-to-voice latent mapping for nat- ural zero-shot speech synthesis from a face image.arXiv preprint:2311.05844, 2023

    Minki Kang, Wooseok Han, and Eunho Yang. Face- StyleSpeech: Improved face-to-voice latent mapping for nat- ural zero-shot speech synthesis from a face image.arXiv preprint:2311.05844, 2023. 1

  24. [24]

    Cam- bridge University Press, 2011

    Frank P Kelly.Reversibility and stochastic networks. Cam- bridge University Press, 2011. 3

  25. [25]

    Let there be sound: Reconstructing high quality speech from silent videos

    Ji-Hoon Kim, Jaehun Kim, and Joon Son Chung. Let there be sound: Reconstructing high quality speech from silent videos. InAAAI Conf. Artif. Intell., pages 2759–2767, 2024. 6

  26. [26]

    From faces to voices: Learning hi- erarchical representations for high-quality video-to-speech

    Ji-Hoon Kim, Jeongsoo Choi, Jaehun Kim, Chaeyoung Jung, and Joon Son Chung. From faces to voices: Learning hi- erarchical representations for high-quality video-to-speech. InIEEE Conf. Comput. Vis. Pattern Recog., pages 15874– 15884, 2025. 1, 2, 6, 7

  27. [27]

    Lip-to-speech synthesis in the wild with multi-task learning

    Minsu Kim, Joanna Hong, and Yong Man Ro. Lip-to-speech synthesis in the wild with multi-task learning. InIEEE Conf. Acoust. Speech Signal Process., pages 1–5, 2023. 2, 6

  28. [28]

    Hier- Speech: Bridging the gap between text and speech by hierar- chical variational inference using self-supervised representa- tions for speech synthesis

    Sang-Hoon Lee, Seung-Bin Kim, Ji-Hyun Lee, et al. Hier- Speech: Bridging the gap between text and speech by hierar- chical variational inference using self-supervised representa- tions for speech synthesis. InAdv. Neural Inform. Process. Syst., 2022. 2

  29. [29]

    NaturalL2S: End-to- end high-quality multispeaker lip-to-speech synthesis with 9 differential digital signal processing.Neural Networks, 194: 108163, 2026

    Yifan Liang, Fangkun Liu, Andong Li, Xiaodong Li, Chengyou Lei, and Chengshi Zheng. NaturalL2S: End-to- end high-quality multispeaker lip-to-speech synthesis with 9 differential digital signal processing.Neural Networks, 194: 108163, 2026. 1

  30. [30]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInt. Conf. Learn. Represent., 2019. 5

  31. [31]

    Discrete diffusion modeling by estimating the ratios of the data distri- bution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distri- bution. InInt. Conf. on Mach. Learn., 2024. 2, 3, 4, 5

  32. [32]

    DPM-Solver++: Fast solver for guided sam- pling of diffusion probabilistic models.Mach

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver++: Fast solver for guided sam- pling of diffusion probabilistic models.Mach. Intell. Res., 22 (4):730–751, 2025. 2

  33. [33]

    EmoBox: Multilingual multi-corpus speech emo- tion recognition toolkit and benchmark

    Ziyang Ma, Mingjie Chen, Hezhao Zhang, Zhisheng Zheng, Wenxi Chen, Xiquan Li, Jiaxin Ye, Xie Chen, and Thomas Hain. EmoBox: Multilingual multi-corpus speech emo- tion recognition toolkit and benchmark. InAnnu. Conf. Int. Speech Commun. Assoc., 2024. 5

  34. [34]

    emotion2vec: Self- supervised pre-training for speech emotion representation

    Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. emotion2vec: Self- supervised pre-training for speech emotion representation. In Findings Proc. Annu. Meeting Assoc. Comput. Linguistics, pages 15747–15760, 2024. 5

  35. [35]

    POSTER V2: A simpler and stronger facial expression recognition network.arXiv preprint:2301.12149, 2023

    Jiawei Mao, Rui Xu, Xuesong Yin, Yuanqi Chang, Bin- ling Nie, and Aibin Huang. POSTER V2: A simpler and stronger facial expression recognition network.arXiv preprint:2301.12149, 2023. 3, 8

  36. [36]

    Concrete score matching: Generalized score matching for discrete data

    Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Er- mon. Concrete score matching: Generalized score matching for discrete data. InAdv. Neural Inform. Process. Syst., 2022. 2

  37. [37]

    arXiv preprint arXiv:2406.03736 , year=

    Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing dis- crete diffusion secretly models the conditional distributions of clean data.CoRR, abs/2406.03736, 2024. 3

  38. [38]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InInt. Conf. Comput. Vis., pages 4172– 4182, 2023. 8

  39. [39]

    V oiceCraft: Zero-shot speech editing and text-to-speech in the wild

    Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrah- man Mohamed, and David Harwath. V oiceCraft: Zero-shot speech editing and text-to-speech in the wild. InProc. Annu. Meeting Assoc. Comput. Linguistics, pages 12442–12462,

  40. [40]

    Powerset multi-class cross entropy loss for neural speaker diarization

    Alexis Plaquet and Herv ´e Bredin. Powerset multi-class cross entropy loss for neural speaker diarization. InAnnu. Conf. Int. Speech Commun. Assoc., 2023. 5

  41. [41]

    Audio-visual automatic speech recogni- tion: An overview.Issues in visual and audio-visual speech processing, 22:23, 2004

    Gerasimos Potamianos, Chalapathy Neti, Juergen Luettin, Iain Matthews, et al. Audio-visual automatic speech recogni- tion: An overview.Issues in visual and audio-visual speech processing, 22:23, 2004. 1

  42. [42]

    K. R. Prajwal, Rudrabha Mukhopadhyay, Vinay P. Nam- boodiri, and C. V . Jawahar. Learning individual speaking styles for accurate lip to speech synthesis. InIEEE Conf. Comput. Vis. Pattern Recog., pages 13793–13802, 2020. 6

  43. [43]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInt. Conf. on Mach. Learn., pages 28492–28518, 2023. 1, 5

  44. [44]

    SpeechBrain: A general-purpose speech toolkit,

    Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, et al. SpeechBrain: A general-purpose speech toolkit.arXiv preprint:2106.04624, 2021. 1, 5

  45. [45]

    Chandan K. A. Reddy, Vishak Gopal, and Ross Cutler. Dns- mos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. InIEEE Conf. Acoust. Speech Signal Process., pages 6493–6497, 2021. 5

  46. [46]

    Utmos: Utokyo-sarulab sys- tem for voicemos challenge 2022.arXiv preprint arXiv:2204.02152,

    Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Ko- riyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Ut- mos: Utokyo-sarulab system for voicemos challenge 2022. arXiv preprint:2204.02152, 2022. 5

  47. [47]

    Learning audio-visual speech representa- tion by masked multimodal cluster prediction

    Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Abdelrah- man Mohamed. Learning audio-visual speech representa- tion by masked multimodal cluster prediction. InInt. Conf. Learn. Represent., 2022. 3

  48. [48]

    Lip reading sentences in the wild

    Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Lip reading sentences in the wild. InIEEE Conf. Comput. Vis. Pattern Recog., pages 6447–6456, 2017. 5

  49. [49]

    Score-based continuous-time discrete diffusion models

    Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Han- jun Dai. Score-based continuous-time discrete diffusion models. InInt. Conf. Learn. Represent., 2023. 3

  50. [50]

    V oiceCraft-Dub: Automated video dubbing with neural codec language mod- els

    Kim Sung-Bin, Jeongsoo Choi, Puyuan Peng, Joon Son Chung, Tae-Hyun Oh, and David Harwath. V oiceCraft-Dub: Automated video dubbing with neural codec language mod- els. InInt. Conf. Comput. Vis., 2025. 2

  51. [51]

    V oxLingua107: A dataset for spoken language recognition

    J ¨orgen Valk and Tanel Alum¨ae. V oxLingua107: A dataset for spoken language recognition. InProc. IEEE SLT Workshop,

  52. [52]

    Generalized end-to-end loss for speaker verifica- tion

    Li Wan, Quan Wang, Alan Papir, and Ignacio L ´opez- Moreno. Generalized end-to-end loss for speaker verifica- tion. InIEEE Conf. Acoust. Speech Signal Process., pages 4879–4883, 2018. 3

  53. [53]

    FLDM- VTON: Faithful latent diffusion model for virtual try-on

    Chenhui Wang, Tao Chen, Zhihao Chen, Zhizhong Huang, Taoran Jiang, Qi Wang, and Hongming Shan. FLDM- VTON: Faithful latent diffusion model for virtual try-on. In Proc. Int. Joint Conf. Artif. Intell., pages 1362–1370, 2024. 2

  54. [54]

    Brain-WM: Brain glioblastoma world model.arXiv preprint:2603.07562, 2026

    Chenhui Wang, Boyun Zheng, Liuxin Bao, Zhihao Peng, Peter YM Woo, Hongming Shan, and Yixuan Yuan. Brain-WM: Brain glioblastoma world model.arXiv preprint:2603.07562, 2026. 2

  55. [55]

    Yuxuan Wang, Daisy Stanton, Yu Zhang, R. J. Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A. Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. InInt. Conf. on Mach. Learn., pages 5167–5176, 2018. 5

  56. [56]

    Maskgct: Zero-shot text-to-speech with masked generative codec transformer

    Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Shunsi Zhang, and Zhizheng Wu. MaskGCT: Zero-shot text-to- speech with masked generative codec transformer.arXiv preprint:2409.00750, 2024. 2, 4, 5

  57. [57]

    Dreamvideo-2: Zero-shot subject- driven video customization with precise motion control

    Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Hao- nan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, Yingya Zhang, and Hongming Shan. DreamVideo-2: Zero-shot subject-driven video customiza- tion with precise motion control.arXiv preprint:2410.13830,

  58. [58]

    CTL-MTNet: A novel capsnet and transfer learning-based mixed task net for single-corpus and cross-corpus speech emotion recogni- tion

    Xin-Cheng Wen, Jiaxin Ye, Yan Luo, Yong Xu, Xuan-Ze Wang, Chang-Li Wu, and Kun-Hong Liu. CTL-MTNet: A novel capsnet and transfer learning-based mixed task net for single-corpus and cross-corpus speech emotion recogni- tion. InProc. Int. Joint Conf. Artif. Intell., pages 2305–2311,

  59. [59]

    DCTTS: Discrete diffusion model with contrastive learning for text- to-speech generation

    Zhichao Wu, Qiulin Li, Sixing Liu, and Qun Yang. DCTTS: Discrete diffusion model with contrastive learning for text- to-speech generation. InIEEE Conf. Acoust. Speech Signal Process., pages 11336–11340. IEEE, 2024. 2

  60. [60]

    Diffsound: Discrete dif- fusion model for text-to-sound generation.IEEE ACM Trans

    Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. Diffsound: Discrete dif- fusion model for text-to-sound generation.IEEE ACM Trans. Audio Speech Lang. Process., 31:1720–1733, 2023. 2

  61. [61]

    Emo- DNA: Emotion decoupling and alignment learning for cross- corpus speech emotion recognition

    Jiaxin Ye, Yujie Wei, Xin-Cheng Wen, Chenglong Ma, Zhizhong Huang, Kunhong Liu, and Hongming Shan. Emo- DNA: Emotion decoupling and alignment learning for cross- corpus speech emotion recognition. InACM Int. Conf. Mul- timedia, pages 5956–5965, 2023. 1

  62. [62]

    Temporal modeling matters: A novel temporal emotional modeling approach for speech emotion recognition

    Jiaxin Ye, Xin-Cheng Wen, Yujie Wei, Yong Xu, Kunhong Liu, and Hongming Shan. Temporal modeling matters: A novel temporal emotional modeling approach for speech emotion recognition. InIEEE Conf. Acoust. Speech Signal Process.,, pages 1–5, 2023. 1

  63. [63]

    Emotional face-to-speech

    Jiaxin Ye, Boyuan Cao, and Hongming Shan. Emotional face-to-speech. InInt. Conf. on Mach. Learn., 2025. 1, 5, 6

  64. [64]

    LipV oicer: Generating speech from silent videos guided by lip reading

    Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gan- not, and Ethan Fetaya. LipV oicer: Generating speech from silent videos guided by lip reading. InInt. Conf. Learn. Rep- resent., 2024. 2

  65. [65]

    A survey on personalized content synthesis with diffusion models.Mach

    Xulu Zhang, Xiaoyong Wei, Wentao Hu, Jinlin Wu, Jiaxin Wu, Wengyu Zhang, Zhaoxiang Zhang, Zhen Lei, and Qing Li. A survey on personalized content synthesis with diffusion models.Mach. Intell. Res., 22(5):817–848, 2025. 2

  66. [66]

    Clearervoice-studio: Bridging advanced speech processing research and practical deployment.arXiv preprint:2506.19398, 2025

    Shengkui Zhao, Zexu Pan, and Bin Ma. Clearervoice-studio: Bridging advanced speech processing research and practical deployment.arXiv preprint:2506.19398, 2025. 5

  67. [67]

    POSTER: A pyramid cross-fusion transformer network for facial expres- sion recognition

    Ce Zheng, Mat ´ıas Mendieta, and Chen Chen. POSTER: A pyramid cross-fusion transformer network for facial expres- sion recognition. InInt. Conf. Comput. Vis. Workshop, pages 3138–3147, 2023. 8

  68. [68]

    Srcodec: Split-residual vector quantization for neural speech codec

    Youqiang Zheng, Weiping Tu, Li Xiao, and Xinmeng Xu. Srcodec: Split-residual vector quantization for neural speech codec. InIEEE Conf. Acoust. Speech Signal Process., pages 451–455, 2024. 2

  69. [69]

    Unified thinker: A general reasoning modular core for image generation, 2026

    Sashuai Zhou, Qiang Zhou, Jijin Hu, Hanqing Yang, Yue Cao, Junpeng Ma, Yinchao Ma, Jun Song, Tiezheng Ge, Cheng Yu, Bo Zheng, and Zhou Zhao. Unified thinker: A general reasoning modular core for image generation, 2026. 2

  70. [70]

    Spatialreward: Verifi- able spatial reward modeling for fine-grained spatial consis- tency in text-to-image generation, 2026

    Sashuai Zhou, Qiang Zhou, Junpeng Ma, Yue Cao, Ruofan Hu, Ziang Zhang, Xiaoda Yang, Zhibin Wang, Jun Song, Cheng Yu, Bo Zheng, and Zhou Zhao. Spatialreward: Verifi- able spatial reward modeling for fine-grained spatial consis- tency in text-to-image generation, 2026. 2 11