pith. machine review for the scientific record. sign in

arxiv: 2604.23632 · v1 · submitted 2026-04-26 · 💻 cs.CV · cs.MM· cs.SD

Recognition: unknown

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

Chunyu Li, Haoyuan Xia, Hao Zhu, Jiaye Li, Jingdong Wang, Ruiqiao Mei, Siyu Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:50 UTC · model grok-4.3

classification 💻 cs.CV cs.MMcs.SD
keywords real-time avatar generationaudio-video synthesisdiffusion modelspreference distillationstreaming generationjoint audio-visualtext-driven avatarsfew-step acceleration
0
0 comments X

The pith

Hallo-Live generates synchronized audio-video avatars in real time by combining asynchronous dual-stream diffusion with preference-guided distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Hallo-Live as a streaming system that produces portrait video and matching speech from text input fast enough for live use. It splits generation into separate video and audio diffusion streams that run asynchronously while using Future-Expanding Attention to give each video segment access to current and near-future audio cues for better lip sync. To recover quality after aggressive few-step acceleration, the method applies Human-Centric Preference-Guided DMD that reweights training examples according to separate rewards for visual realism, speech naturalness, and audio-visual alignment. On two H200 GPUs the system reaches 20.38 frames per second at 0.94 seconds latency, delivering 16 times the throughput and 99 times lower latency than the full teacher model while scoring comparably on VideoAlign and Sync metrics. The result also works across photorealistic, multi-speaker, and stylized faces without retraining.

Core claim

Hallo-Live achieves real-time joint audio-video avatar generation through asynchronous dual-stream diffusion combined with human-centric preference distillation, delivering 20.38 FPS at 0.94 seconds latency on two H200 GPUs, which is 16 times higher throughput and 99.3 times lower latency than the teacher model while preserving comparable VideoAlign overall scores and Sync Confidence scores.

What carries the argument

Asynchronous dual-stream diffusion with Future-Expanding Attention that supplies each video block with synchronous audio plus a short future phonetic horizon, plus Human-Centric Preference-Guided DMD (HP-DMD) that reweights distillation samples by rewards for visual fidelity, speech naturalness, and audio-visual synchronization.

If this is right

  • The framework supports interactive applications such as live virtual assistants or real-time video dubbing.
  • Generation quality holds across photorealistic, multi-speaker, and stylized avatar styles without additional fine-tuning.
  • The method outperforms prior accelerated baselines on the combined quality-efficiency metric.
  • Streaming dual-stream design reduces articulation lag while keeping audio and video aligned.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same preference-reweighting idea could shorten inference in other multimodal diffusion models that currently require many steps.
  • Further hardware-specific optimizations might bring similar real-time performance to single-GPU or edge devices.
  • The asynchronous streams could be extended to include additional modalities such as text overlays or gestures with minimal extra latency.

Load-bearing premise

Reweighting training samples by rewards for visual fidelity, speech naturalness, and audio-visual synchronization is enough to prevent quality drop from few-step distillation without creating artifacts that the chosen metrics miss.

What would settle it

A side-by-side user study on 100 held-out prompts showing that Hallo-Live outputs receive significantly lower preference votes than the teacher model on lip-sync naturalness or visual realism would falsify the claim that HP-DMD fully compensates for acceleration.

Figures

Figures reproduced from arXiv: 2604.23632 by Chunyu Li, Haoyuan Xia, Hao Zhu, Jiaye Li, Jingdong Wang, Ruiqiao Mei, Siyu Zhu.

Figure 1
Figure 1. Figure 1: Our method enables real-time streaming text-driven joint audio-video avatar generation. On two NVIDIA H200 GPUs, Hallo-Live view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Hallo-Live. Top left: Stage I adapts a pretrained dual-stream DiT to the streaming setting using cross-modal future view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of attention mechanisms. (a) The Strict Block view at source ↗
Figure 5
Figure 5. Figure 5: Comparison with state-of-the-art methods (Ovi [ view at source ↗
Figure 6
Figure 6. Figure 6: Generation results with different prompts. The figure showcases diverse generation capabilities, ranging from specific spatial view at source ↗
Figure 7
Figure 7. Figure 7: Line plot of the Sync-C score under different attention view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison of individual reward enhancements. The reward-weighted distillation allows the student to pull its distribution view at source ↗
read the original abstract

Real-time text-driven joint audio-video avatar generation requires jointly synthesizing portrait video and speech with high fidelity and precise synchronization, yet existing audio-visual diffusion models remain too slow for interactive use and often degrade noticeably after aggressive acceleration. We present Hallo-Live, a streaming framework for joint audio-visual avatar generation that combines asynchronous dual-stream diffusion with human-centric preference-guided distillation. To reduce articulation lag in causal generation, we introduce Future-Expanding Attention, which allows each video block to access synchronous audio together with a short horizon of future phonetic cues. To mitigate the quality loss of few-step distillation, we further propose Human-Centric Preference-Guided DMD (HP-DMD), which reweights training samples using rewards from visual fidelity, speech naturalness, and audio-visual synchronization. On two NVIDIA H200 GPUs, Hallo-Live runs at 20.38 FPS with 0.94 seconds latency, yielding 16.0x higher throughput and 99.3x lower latency than the teacher model Ovi. Despite this speedup, it retains strong generation quality, reaching comparable VideoAlign overall score and Sync Confidence score while outperforming other accelerated baselines in the overall quality-efficiency trade-off. Qualitative results further show robust generalization across photorealistic, multi-speaker, and stylized scenarios. To the best of our knowledge, Hallo-Live is the first framework to combine streaming dual-stream diffusion with preference-guided distillation for real-time, text-driven audio-visual generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents Hallo-Live, a streaming framework for real-time text-driven joint audio-video avatar generation. It combines asynchronous dual-stream diffusion with Future-Expanding Attention (to provide each video block access to synchronous audio plus a short horizon of future phonetic cues) and Human-Centric Preference-Guided DMD (HP-DMD), which reweights training samples via rewards on visual fidelity, speech naturalness, and audio-visual synchronization to offset quality degradation from few-step distillation. The central empirical claim is that the method achieves 20.38 FPS and 0.94 s latency on two NVIDIA H200 GPUs (16.0x higher throughput and 99.3x lower latency than the teacher model Ovi) while attaining comparable VideoAlign overall and Sync Confidence scores and outperforming other accelerated baselines in the quality-efficiency trade-off.

Significance. If the performance and quality-retention claims are rigorously supported, the work would be significant for enabling interactive applications in avatar synthesis and real-time multimedia. The combination of streaming dual-stream diffusion and preference-guided distillation addresses a clear practical bottleneck in diffusion-based audio-visual generation, and the reported speedups are substantial. However, the significance is tempered by the absence of ablations, error bars, or statistical validation in the reported results.

major comments (2)
  1. [Abstract and experimental results] Abstract and experimental results: The reported metrics (20.38 FPS, 0.94 s latency, 16.0x throughput, comparable VideoAlign/Sync scores) are presented without error bars, statistical tests, full experimental details, or ablation studies on the individual contributions of Future-Expanding Attention and HP-DMD. These omissions are load-bearing for the central claim that quality is retained despite aggressive acceleration, as the skeptic note highlights that HP-DMD reward reweighting may not fully mitigate artifacts or metric biases.
  2. [HP-DMD section] HP-DMD section: The description of reweighting samples using rewards from visual fidelity, speech naturalness, and audio-visual synchronization does not include analysis of whether these rewards comprehensively cover potential degradation modes (e.g., subtle temporal inconsistencies or unnatural prosody) or whether they introduce biases relative to the reported evaluation metrics. This directly affects the defensibility of the quality-retention claim after few-step distillation.
minor comments (1)
  1. [Abstract] The abstract refers to 'qualitative results' showing generalization across photorealistic, multi-speaker, and stylized scenarios but does not indicate the number of examples or evaluation protocol, which would improve clarity on the robustness claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback that identifies key areas to strengthen the empirical support for our claims. We address each major comment below and will revise the manuscript accordingly to incorporate additional analyses and details.

read point-by-point responses
  1. Referee: [Abstract and experimental results] Abstract and experimental results: The reported metrics (20.38 FPS, 0.94 s latency, 16.0x throughput, comparable VideoAlign/Sync scores) are presented without error bars, statistical tests, full experimental details, or ablation studies on the individual contributions of Future-Expanding Attention and HP-DMD. These omissions are load-bearing for the central claim that quality is retained despite aggressive acceleration, as the skeptic note highlights that HP-DMD reward reweighting may not fully mitigate artifacts or metric biases.

    Authors: We agree that error bars, statistical tests, full experimental details, and dedicated ablations are important for rigorously supporting the quality-retention claim. In the revised manuscript, we will add error bars derived from multiple runs for the key metrics (FPS, latency, VideoAlign, and Sync Confidence). We will also include ablation studies that isolate the contributions of Future-Expanding Attention and HP-DMD, along with expanded experimental details on the setup. To address concerns about artifacts and metric biases, we will add discussion and qualitative analysis showing how the reported baselines and human-centric rewards help mitigate common degradation modes. revision: yes

  2. Referee: [HP-DMD section] HP-DMD section: The description of reweighting samples using rewards from visual fidelity, speech naturalness, and audio-visual synchronization does not include analysis of whether these rewards comprehensively cover potential degradation modes (e.g., subtle temporal inconsistencies or unnatural prosody) or whether they introduce biases relative to the reported evaluation metrics. This directly affects the defensibility of the quality-retention claim after few-step distillation.

    Authors: We acknowledge that the HP-DMD description would benefit from explicit analysis of reward coverage and potential biases. In the revision, we will expand the HP-DMD section with a discussion of how the three reward components target degradation modes such as temporal inconsistencies and unnatural prosody. We will also analyze alignment with evaluation metrics by referencing our experimental results, including comparisons that show the reweighting preserves Sync Confidence and VideoAlign scores without introducing evident biases. Supplementary material will include reward distribution statistics if space is limited in the main text. revision: yes

Circularity Check

0 steps flagged

No load-bearing circularity; empirical results stand on independent measurements

full rationale

The paper proposes architectural additions (asynchronous dual-stream diffusion, Future-Expanding Attention, HP-DMD reweighting) and validates them via direct runtime measurements (20.38 FPS, 0.94 s latency) and quality scores (VideoAlign, Sync Confidence) against an external teacher model Ovi and other baselines. No equations or derivations reduce the reported metrics to the reward definitions by construction; the reweighting is a training technique whose success is checked by separate evaluation. No self-citation chains or uniqueness theorems are invoked as load-bearing support. The work is therefore self-contained as an empirical systems contribution.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

Inferred from abstract only. The paper introduces two new mechanisms and relies on standard assumptions about diffusion models and reward models. No explicit free parameters are named, but design choices like future horizon and reward weighting are implicit.

free parameters (2)
  • future phonetic cue horizon length
    Short horizon of future cues in Future-Expanding Attention is a tunable design parameter affecting synchronization.
  • distillation step count
    Few-step distillation requires choosing the number of steps as a hyperparameter.
axioms (2)
  • domain assumption Diffusion models can be distilled to few steps while preserving quality when guided by appropriate rewards.
    Underpins the claim that HP-DMD mitigates quality loss.
  • domain assumption Human preference rewards for visual fidelity, speech naturalness, and synchronization can be computed reliably and used for reweighting.
    Central to the HP-DMD method described.
invented entities (2)
  • Future-Expanding Attention no independent evidence
    purpose: Allows each video block to access synchronous audio plus a short horizon of future phonetic cues to reduce articulation lag in causal streaming.
    New attention variant introduced to address synchronization in real-time generation.
  • Human-Centric Preference-Guided DMD (HP-DMD) no independent evidence
    purpose: Reweights training samples using multi-aspect rewards to reduce quality degradation during few-step distillation.
    New distillation variant proposed to maintain quality under acceleration.

pith-pipeline@v0.9.0 · 5596 in / 1632 out tokens · 47836 ms · 2026-05-08T06:50:27.468279+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 32 canonical work pages · 9 internal anchors

  1. [1]

    Dan Bigioi, Shubhajit Basak, Michał Stypułkowski, Maciej Zieba, Hugh Jordan, Rachel McDonnell, and Peter Corcoran. 2024. Speech driven video editing via an audio-conditioned diffusion model.Image and Vision Computing142 (2024), 104911

  2. [2]

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine

  3. [3]

    Training Diffusion Models with Reinforcement Learning.arXiv preprint arXiv:2305.13301(2023)

  4. [4]

    Zhiyuan Chen, Jiajiong Cao, Zhiquan Chen, Yuming Li, and Chenguang Ma

  5. [5]

    Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions.arXiv preprint arXiv:2407.08136(2024)

  6. [6]

    Ethan Chern, Hansi Teng, Hanwen Sun, Hao Wang, Hong Pan, Hongyu Jia, Jiadi Su, Jin Li, Junjie Yu, Lijie Liu, et al. 2026. Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model.arXiv preprint arXiv:2603.21986(2026)

  7. [7]

    Joon Son Chung and Andrew Zisserman. 2016. Out of time: automated lip sync in the wild. InAsian conference on computer vision. Springer, 251–263

  8. [8]

    Jiahao Cui, Yan Chen, Mingwang Xu, Hanlin Shang, Yuxuan Chen, Yun Zhan, Zilong Dong, Yao Yao, Jingdong Wang, and Siyu Zhu. 2025. Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation.arXiv e-prints(2025), arXiv–2505

  9. [9]

    Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, and Jingdong Wang. 2024. Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation.arXiv preprint arXiv:2410.07718(2024)

  10. [10]

    Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. 2025. Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. InProceedings of the Computer Vision and Pattern Recognition Conference. 21086–21095

  11. [11]

    Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, and Steven Hoi. 2025. Omni- Avatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation.arXiv preprint arXiv:2506.18866(2025)

  12. [12]

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. 2026. LTX-2: Efficient Joint Audio-Visual Foundation Model. arXiv preprint arXiv:2601.03233(2026)

  13. [13]

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. 2025. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009(2025)

  14. [14]

    Xiaozhong Ji, Xiaobin Hu, Zhihong Xu, Junwei Zhu, Chuming Lin, Qingdong He, Jiangning Zhang, Donghao Luo, Yi Chen, Qin Lin, et al. 2025. Sonic: Shifting focus to global audio perception in portrait animation. InProceedings of the Computer Vision and Pattern Recognition Conference. 193–203

  15. [15]

    Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Xin Jin, David Liu, Zhen Li, Mengmeng Wang, Peng Gao, and Harry Yang. 2025. Distribution Match- ing Distillation Meets Reinforcement Learning.arXiv preprint arXiv:2511.13649 (2025)

  16. [16]

    Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. 2024. Loopy: Taming audio-driven portrait avatar with long-term motion dependency. InThe Thirteenth International Conference on Learning Representa- tions

  17. [17]

    Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. 2023. Aligning Text-to-Image Models using Human Feedback.arXiv preprint arXiv:2302.12192 (2023)

  18. [18]

    Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Weiwei Xing. 2024. Latentsync: Taming audio- conditioned latent diffusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262(2024)

  19. [19]

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. 2025. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918(2025)

  20. [20]

    Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, and Tat-Seng Chua. 2025. JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization.arXiv preprint arXiv:2503.23377(2025)

  21. [21]

    Chetwin Low, Weimin Wang, and Calder Katyal. 2025. Ovi: Twin backbone cross-modal fusion for audio-video generation.arXiv preprint arXiv:2510.01284 (2025)

  22. [22]

    Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al . 2025. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678(2025)

  23. [23]

    Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, and Jing Tang. 2025. Learning Few-Step Diffusion Models by Trajectory Distribution Matching.arXiv preprint arXiv:2503.06674(2025)

  24. [24]

    Yifeng Ma, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yingya Zhang, and Zhidong Deng. 2023. Dreamtalk: When expressive talking head generation meets diffusion probabilistic models.arXiv preprint arXiv:2312.09767(2023). 9 Li et al

  25. [25]

    Soumik Mukhopadhyay, Saksham Suri, Ravi Teja Gadde, and Abhinav Shrivas- tava. 2024. Diff2lip: Audio conditioned diffusion models for lip-synchronization. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5292–5302

  26. [26]

    William Peebles and Saining Xie. 2023. Scalable diffusion models with transform- ers. InProceedings of the IEEE/CVF international conference on computer vision. 4195–4205

  27. [27]

    Ziqiao Peng, Jiwen Liu, Haoxian Zhang, Xiaoqiang Liu, Songlin Tang, Pengfei Wan, Di Zhang, Hongyan Liu, and Jun He. 2025. Omnisync: Towards universal lip synchronization via diffusion transformers.arXiv preprint arXiv:2505.21448 (2025)

  28. [28]

    K R Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, and C V Jawahar

  29. [29]

    InProceedings of the 28th ACM International Conference on Multimedia

    A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild. InProceedings of the 28th ACM International Conference on Multimedia. 484–492

  30. [30]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.Journal of Machine Learning Research21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html

  31. [31]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

  32. [32]

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention. Springer, 234–241

  33. [33]

    Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. 2023. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10219– 10228

  34. [34]

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538(2017)

  35. [35]

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568 (2024), 127063

  36. [36]

    Yaofeng Su, Yuming Li, Zeyue Xue, Jie Huang, Siming Fu, Haoran Li, Ying Li, Zezhong Qian, Haoyang Huang, and Nan Duan. 2026. OmniForcing: Unleashing Real-time Joint Audio-Visual Generation.arXiv preprint arXiv:2603.11647(2026)

  37. [37]

    OpenMOSS Team, Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, et al . 2026. Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794(2026)

  38. [38]

    Qwen Team. 2026. Qwen3. 5-Omni Technical Report.arXiv preprint arXiv:2604.15804(2026)

  39. [39]

    Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. 2024. EMO: Emote Portrait Alive – Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions.arXiv preprint arXiv:2402.17485(2024)

  40. [40]

    Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, et al . 2025. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound.arXiv preprint arXiv:2502.05139(2025)

  41. [41]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  42. [42]

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. 2025. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314(2025)

  43. [43]

    Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, and Gang Yu. 2025. UniVerse-1: Unified Audio-Video Generation via Stitching of Experts.arXiv preprint arXiv:2509.06155(2025)

  44. [44]

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jin- sheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. 2024. Emu3: Next-Token Prediction is All You Need.arXiv preprint arXiv:2409.18869(2024)

  45. [45]

    Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, and Siyu Zhu. 2024. Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation.arXiv preprint arXiv:2406.08801(2024)

  46. [46]

    Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. 2024. VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time.arXiv preprint arXiv:2404.10667(2024)

  47. [47]

    Yilun Xu, Weili Nie, and Arash Vahdat. 2025. One-step Diffusion Models with 𝑓-Divergence Distribution Matching.arXiv preprint arXiv:2502.15681(2025)

  48. [48]

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. 2024. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems37 (2024), 47455–47487

  49. [49]

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. 2024. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6613–6623

  50. [50]

    Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. 2023. SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8652–8661

  51. [51]

    Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Bin Liu, and Kai Chen. 2026. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds.International Journal of Computer Vision134, 1 (2026), 46

  52. [52]

    Yue Zhang, Minhao Liu, Zhaokang Chen, Bin Wu, Yubin Zeng, Chao Zhan, Yingjie He, Junxin Huang, and Wenjiang Zhou. 2024. MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting.arXiv preprint arXiv:2410.10122(2024)

  53. [53]

    Dingcheng Zhen, Shunshun Yin, Shiyang Qin, Hou Yi, Ziwei Zhang, Siyuan Liu, Gan Qi, and Ming Tao. 2025. Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation.arXiv preprint arXiv:2503.18429(2025)

  54. [54]

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. 2025. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755(2025)

  55. [55]

    Yongming Zhu, Longhao Zhang, Zhengkun Rong, Tianshu Hu, Shuang Liang, and Zhipeng Ge. 2025. INFP: Audio-driven interactive head generation in dyadic conversations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10667–10677. 10 Appendix A Additional Implementation Details Streaming inference procedure.At inference time...

  56. [56]

    visual quality (VQ) of at least -0.8, VideoAlign text alignment (TA) of at least 0.8, Sync confidence of at least 3.0, and a VBench

  57. [57]

    After filtering, the final dataset contains 20,000 high-quality prompts, corresponding to approximately 28 hours of paired audio-video training data

    human anatomy score of at least 0.7. After filtering, the final dataset contains 20,000 high-quality prompts, corresponding to approximately 28 hours of paired audio-video training data. For clarity, the full data pipeline is summarized below: (1) Expand the 100 seed prompts with Qwen3.5-Plus to obtain a large candidate prompt pool; (2) Remove near-duplic...