Recognition: unknown
Generate Your Talking Avatar from Video Reference
Pith reviewed 2026-05-07 05:27 UTC · model grok-4.3
The pith
TAVR generates talking avatars from video references across different scenes using token selection and staged training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By replacing the conventional single-image, same-scene reference with cross-scene video inputs, TAVR demonstrates that a token selection module together with same-scene pretraining, cross-scene fine-tuning, and identity reinforcement learning can bridge large domain gaps and synthesize high-fidelity talking avatars that preserve temporal coherence and lip synchronization.
What carries the argument
Token selection module that extracts relevant tokens from extended cross-scene video sequences, combined with the three-stage training scheme of same-scene pretraining for appearance copying, cross-scene fine-tuning for domain adaptation, and identity-based reinforcement learning for final alignment.
If this is right
- Generation becomes possible with any suitable video reference at inference time rather than a matched static image.
- Quantitative and qualitative performance exceeds prior same-scene image-conditioned methods on cross-scene tasks.
- High-fidelity avatars can be placed in fully customized backgrounds while retaining identity and expression fidelity.
- The introduced 158-pair benchmark provides a standardized test for evaluating cross-scene robustness in talking-avatar systems.
Where Pith is reading between the lines
- Everyday user videos recorded in varied lighting and settings could serve directly as references without requiring studio-matched footage.
- The staged adaptation pattern may transfer to other video-generation problems that face large domain shifts between conditioning and target content.
- The approach opens the possibility of on-demand avatar creation for applications such as personalized video messages or virtual meetings.
Load-bearing premise
The token selection module and three-stage training sequence can reliably bridge large visual and temporal differences between reference and target scenes without introducing artifacts or breaking lip synchronization and motion naturalness.
What would settle it
If quantitative metrics on the 158-pair cross-scene benchmark show no improvement over image-based baselines or if generated videos display visible lip-sync errors and motion artifacts when the reference video comes from a different scene, the central claim would be falsified.
read the original abstract
Existing talking avatar methods typically adopt an image-to-video pipeline conditioned on a static reference image within the same scene as the target generation. This restricted, single-view perspective lacks sufficient temporal and expression cues, limiting the ability to synthesize high-fidelity talking avatars in customized backgrounds. To this end, we introduce Talking Avatar generation from Video Reference (TAVR), a novel framework that shifts the paradigm by leveraging cross-scene video inputs. To effectively process these extended temporal contexts and bridge cross-scene domain gaps, TAVR integrates a token selection module alongside a comprehensive three-stage training scheme. Specifically, same-scene video pretraining establishes foundational appearance copying, which is subsequently expanded by cross-scene reference fine-tuning for robust cross-scene adaptation. Finally, task-specific reinforcement learning aligns the generated outputs with identity-based rewards to maximize identity similarity. To systematically evaluate cross-scene robustness, we construct a new benchmark comprising 158 carefully curated cross-scene video pairs. Extensive experiments show that TAVR benefits from flexible inference-time video referencing and consistently surpasses existing baselines both quantitatively and qualitatively. This work has been deployed to production. For more related research, please visit \href{https://www.heygen.com/research}{HeyGen Research} and \href{https://www.heygen.com/research/avatar-v-model}{HeyGen Avatar-V}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Talking Avatar generation from Video Reference (TAVR), a framework for synthesizing talking avatars conditioned on cross-scene video references rather than single same-scene images. It proposes a token selection module to process extended temporal contexts and a three-stage training pipeline (same-scene pretraining for appearance copying, cross-scene fine-tuning for domain adaptation, and identity reinforcement learning). A new benchmark of 158 curated cross-scene video pairs is introduced for evaluation, with claims of quantitative and qualitative superiority over baselines and production deployment at HeyGen.
Significance. If the empirical claims are substantiated with concrete metrics and ablations, TAVR could meaningfully advance talking-head video synthesis by relaxing the single-image, same-scene constraint and enabling more flexible, temporally rich references. The staged training plus RL alignment addresses a recognized practical challenge in cross-domain avatar generation, and the new benchmark fills a gap in cross-scene evaluation. Production deployment is a positive indicator of real-world utility, though the absence of reported numbers currently limits assessment of effect size and robustness.
major comments (4)
- Abstract: the claim that TAVR 'consistently surpasses existing baselines both quantitatively and qualitatively' on the 158-pair benchmark supplies no numerical results (e.g., FID, lip-sync error, temporal consistency scores), no error bars, and no ablation tables isolating the token selection module or each training stage. This information is load-bearing for the central empirical claim.
- Method section (token selection module): the architecture, selection criterion, and integration of the token selection module are not described with equations or pseudocode. Without these details it is impossible to determine how the module bridges large cross-scene domain gaps while preserving lip synchronization and motion coherence.
- Training and Experiments sections: no ablation results are presented that quantify the incremental contribution of same-scene pretraining, cross-scene fine-tuning, and identity RL, nor are the RL reward weights or hyperparameters reported. This leaves the weakest assumption—that the three-stage scheme reliably closes domain gaps without introducing artifacts—unverified.
- Benchmark construction: the 158-pair dataset is described only as 'carefully curated'; no selection protocol, difficulty stratification, or comparison against existing talking-head datasets is provided. Post-hoc curation raises the possibility that reported gains reflect benchmark composition rather than algorithmic robustness.
minor comments (2)
- Abstract: the production-deployment statement would be strengthened by a brief mention of the specific metrics used in internal validation.
- References: several recent works on video-conditioned avatar generation and RL for identity preservation are not cited; adding them would better situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We appreciate the positive assessment of the potential impact of TAVR, the staged training approach, and the new cross-scene benchmark. We address each major comment below and commit to revising the manuscript to incorporate the requested clarifications, metrics, and details.
read point-by-point responses
-
Referee: Abstract: the claim that TAVR 'consistently surpasses existing baselines both quantitatively and qualitatively' on the 158-pair benchmark supplies no numerical results (e.g., FID, lip-sync error, temporal consistency scores), no error bars, and no ablation tables isolating the token selection module or each training stage. This information is load-bearing for the central empirical claim.
Authors: We agree that the abstract should include concrete numerical support for the central claims. In the revised version we will add the key quantitative results (FID, lip-sync error, temporal consistency) with error bars and will explicitly reference the ablation findings on the token selection module and training stages. revision: yes
-
Referee: Method section (token selection module): the architecture, selection criterion, and integration of the token selection module are not described with equations or pseudocode. Without these details it is impossible to determine how the module bridges large cross-scene domain gaps while preserving lip synchronization and motion coherence.
Authors: We acknowledge that the token selection module requires a more formal specification. The revised manuscript will include the missing equations for the selection criterion and integration, together with pseudocode that shows how temporal tokens are chosen and injected while maintaining lip-sync and motion coherence. revision: yes
-
Referee: Training and Experiments sections: no ablation results are presented that quantify the incremental contribution of same-scene pretraining, cross-scene fine-tuning, and identity RL, nor are the RL reward weights or hyperparameters reported. This leaves the weakest assumption—that the three-stage scheme reliably closes domain gaps without introducing artifacts—unverified.
Authors: We will add a dedicated ablation subsection that isolates the contribution of each training stage (same-scene pretraining, cross-scene fine-tuning, identity RL). We will also report the exact RL reward weights, learning-rate schedules, and other hyperparameters used in the identity reinforcement learning stage. revision: yes
-
Referee: Benchmark construction: the 158-pair dataset is described only as 'carefully curated'; no selection protocol, difficulty stratification, or comparison against existing talking-head datasets is provided. Post-hoc curation raises the possibility that reported gains reflect benchmark composition rather than algorithmic robustness.
Authors: We will expand the benchmark section with a detailed description of the curation protocol, the criteria used to select the 158 cross-scene pairs, the difficulty stratification applied, and direct comparisons against existing talking-head datasets to demonstrate that the reported gains are not an artifact of benchmark composition. revision: yes
Circularity Check
No circularity: empirical pipeline with independent benchmark evaluation
full rationale
The paper presents TAVR as an engineering framework consisting of a token selection module and a three-stage training process (same-scene pretraining, cross-scene fine-tuning, identity RL). No equations, derivations, or quantitative predictions are described that reduce to fitted parameters or self-referential definitions by construction. The central claims rest on experimental results against a newly constructed 158-pair cross-scene benchmark, which is an external evaluation set rather than a quantity defined by the method itself. No self-citation chains or uniqueness theorems are invoked as load-bearing support. This is a standard empirical ML paper whose improvements are not forced by internal definitions.
Axiom & Free-Parameter Ledger
free parameters (1)
- RL reward weights and training hyperparameters
axioms (1)
- domain assumption Neural networks can learn appearance copying from same-scene videos and then adapt to cross-scene inputs via fine-tuning.
invented entities (2)
-
token selection module
no independent evidence
-
three-stage training scheme
no independent evidence
Reference graph
Works this paper leans on
-
[1]
wav2vec 2.0: A framework for self- supervised learning of speech representations
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self- supervised learning of speech representations. InNeurIPS, 2020
2020
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review arXiv 2023
-
[3]
OmniVCus: Feedforward subject-driven video customization with multimodal control conditions
Yuanhao Cai, He Zhang, Xi Chen, Jinbo Xing, Yiwei Hu, Yuqian Zhou, Kai Zhang, Zhifei Zhang, Soo Ye Kim, Tianyu Wang, et al. OmniVCus: Feedforward subject-driven video customization with multimodal control conditions. InNeurIPS, 2025
2025
-
[4]
X-Dyna: Expressive dynamic human image animation
Di Chang, Hongyi Xu, You Xie, Yipeng Gao, Zhengfei Kuang, Shengqu Cai, Chenxu Zhang, Guoxian Song, Chao Wang, Yichun Shi, et al. X-Dyna: Expressive dynamic human image animation. InCVPR, 2025
2025
-
[5]
Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. HuMo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025
-
[6]
Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, et al. TalkVid: A large-scale diversified dataset for audio-driven talking head synthesis.arXiv preprint arXiv:2508.13618, 2025
-
[7]
Multi-subject open-set personalization in video generation
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. Multi-subject open-set personalization in video generation. InCVPR, 2025
2025
-
[8]
Yi Chen, Sen Liang, Zixiang Zhou, Ziyao Huang, Yifeng Ma, Junshu Tang, Qin Lin, Yuan Zhou, and Qinglin Lu. HunyuanVideo-Avatar: High-fidelity audio-driven human animation for multiple characters.arXiv preprint arXiv:2505.20156, 2025
-
[9]
Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Ju Li, Dechao Meng, Jinwei Qi, Penchong Qiao, et al. Wan-Animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025
-
[10]
UniMax: Fairer and more effective language sampling for large-scale multilingual pretraining
Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. UniMax: Fairer and more effective language sampling for large-scale multilingual pretraining. InICLR, 2023
2023
-
[11]
J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. InWorkshop on Multi-view Lip-reading, ACCV, 2016
2016
-
[12]
Jiahao Cui, Yan Chen, Mingwang Xu, Hanlin Shang, Yuxuan Chen, Yun Zhan, Zilong Dong, Yao Yao, Jingdong Wang, and Siyu Zhu. Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation.arXiv preprint arXiv:2505.23525, 2025
-
[13]
Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer
Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. In CVPR, 2025
2025
-
[14]
Arcface: Additive angular margin loss for deep face recognition
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InCVPR, 2019. 12 Research Generate Your Talking Avatar from Video Reference
2019
-
[15]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024
2024
-
[16]
Skyreels-a2: Compose anything in video diffusion transformers
Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. SkyReels-A2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025
-
[17]
Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, and Steven Hoi. OmniAvatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025
-
[18]
Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025
Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, et al. Wan-S2V: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025
-
[19]
Controllable human-centric keyframe interpolation with generative prior
Zujin Guo, Size Wu, Zhongang Cai, Wei Li, and Chen Change Loy. Controllable human-centric keyframe interpolation with generative prior. InNeurIPS, 2025
2025
-
[20]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshop on Deep Generative Models and Downstream Applications, 2021
2021
-
[21]
VideoMage: Multi-subject and motion customization of text-to-video diffusion models
Chi-Pin Huang, Yen-Siang Wu, Hung-Kai Chung, Kai-Po Chang, Fu-En Yang, and Yu-Chiang Frank Wang. VideoMage: Multi-subject and motion customization of text-to-video diffusion models. InCVPR, 2025
2025
-
[22]
Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, and Kun Gai. ConceptMaster: Multi-concept video customization on diffusion transformer models without test-time tuning.arXiv preprint arXiv:2501.04698, 2025
-
[23]
Ultralytics YOLO, January 2023
Glenn Jocher, Jing Qiu, and Ayush Chaurasia. Ultralytics YOLO, January 2023. URLhttps://github.com/ ultralytics/ultralytics
2023
-
[24]
Sapiens: Foundation for human vision models
Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vision models. InECCV, 2024
2024
-
[25]
Yixuan Lai, He Wang, Kun Zhou, and Tianjia Shao. Slot-ID: Identity-preserving video generation from reference videos via slot-based temporal identity encoding.arXiv preprint arXiv:2601.01352, 2026
-
[26]
InfinityHuman: Towards long-term audio-driven human animation.arXiv preprint arXiv:2508.20210, 2025
Xiaodi Li, Pan Xie, Yi Ren, Qijun Gan, Chen Zhang, Fangyuan Kong, Xiang Yin, Bingyue Peng, and Zehuan Yuan. InfinityHuman: Towards long-term audio-driven human animation.arXiv preprint arXiv:2508.20210, 2025
-
[27]
BindWeave: Subject-consistent video generation via cross-modal integration
Zhaoyang Li, Dongjun Qian, Kai Su, Qishuai Diao, Xiangyang Xia, Chang Liu, Wenfei Yang, Tianzhu Zhang, and Zehuan Yuan. BindWeave: Subject-consistent video generation via cross-modal integration. InICLR, 2025
2025
-
[28]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2022
2022
-
[29]
Improving video generation with human feedback
Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback. InNeurIPS, 2025
2025
-
[30]
VideoDPO: Omni-preference alignment for video diffusion generation
Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. VideoDPO: Omni-preference alignment for video diffusion generation. InCVPR, 2025
2025
-
[31]
Playmate2: Training-free multi-character audio-driven animation via diffusion transformer with reward feedback
Xingpei Ma, Shenneng Huang, Jiaran Cai, Yuansheng Guan, Shen Zheng, Hanfeng Zhao, Qiang Zhang, and Shunsi Zhang. Playmate2: Training-free multi-character audio-driven animation via diffusion transformer with reward feedback. InAAAI, 2026
2026
-
[32]
Styletalk: One-shot talking head generation with controllable speaking styles
Yifeng Ma, Suzhen Wang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding, Zhidong Deng, and Xin Yu. Styletalk: One-shot talking head generation with controllable speaking styles. InAAAI, 2023
2023
-
[33]
EchomimicV2: Towards striking, simplified, and semi-body human animation
Rang Meng, Xingyu Zhang, Yuming Li, and Chenguang Ma. EchomimicV2: Towards striking, simplified, and semi-body human animation. InCVPR, 2025
2025
-
[34]
EchomimicV3: 1.3 b parameters are all you need for unified multi-modal and multi-task human animation
Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, and Chenguang Ma. EchomimicV3: 1.3 b parameters are all you need for unified multi-modal and multi-task human animation. InAAAI, 2026
2026
-
[35]
EmoTalk: Speech-driven emotional disentanglement for 3d face animation
Ziqiao Peng, Haoyu Wu, Zhenbo Song, Hao Xu, Xiangyu Zhu, Jun He, Hongyan Liu, and Zhaoxin Fan. EmoTalk: Speech-driven emotional disentanglement for 3d face animation. InICCV, 2023. 13 Research Generate Your Talking Avatar from Video Reference
2023
-
[36]
DualTalk: Dual-speaker interaction for 3d talking head conversations
Ziqiao Peng, Yanbo Fan, Haoyu Wu, Xuan Wang, Hongyan Liu, Jun He, and Zhaoxin Fan. DualTalk: Dual-speaker interaction for 3d talking head conversations. InCVPR, 2025
2025
-
[37]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS, 2023
2023
-
[38]
Longcat-video-avatar technical report
Meituan LongCat Team. Longcat-video-avatar technical report. https://github.com/meituan-longcat/ LongCat-Video/blob/main/assets/LongCat-Video-Avatar-Tech-Report.pdf, 2025
2025
-
[39]
EMO: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions
Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. EMO: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. InECCV, 2024
2024
-
[40]
Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Chong Luo, Zuxuan Wu, and Yu- Gang Jiang. StableAvatar: Infinite-length audio-driven avatar video generation.arXiv preprint arXiv:2508.08248, 2025
-
[41]
StableAnimator: High-quality identity-preserving human image animation
Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu. StableAnimator: High-quality identity-preserving human image animation. InCVPR, 2025
2025
-
[42]
Diffusion model alignment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In CVPR, 2024
2024
-
[43]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review arXiv 2025
-
[44]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
work page internal anchor Pith review arXiv 2025
-
[45]
Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, and Siyu Zhu. Hallo: Hierarchical audio-driven visual synthesis for portrait image animation.arXiv preprint arXiv:2406.08801, 2024
-
[46]
GeneFace: Generalized and high-fidelity audio-driven 3d talking face synthesis
Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, Jinzheng He, and Zhou Zhao. GeneFace: Generalized and high-fidelity audio-driven 3d talking face synthesis. InICLR, 2023
2023
-
[47]
MimicTalk: Mimicking a personalized and expressive 3d talking face in minutes
Zhenhui Ye, Tianyun Zhong, Yi Ren, Ziyue Jiang, Jiawei Huang, Rongjie Huang, Jinglin Liu, Jinzheng He, Chen Zhang, Zehan Wang, et al. MimicTalk: Mimicking a personalized and expressive 3d talking face in minutes. NeurIPS, 2024
2024
-
[48]
Identity-preserving text-to-video generation by frequency decomposition
Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity-preserving text-to-video generation by frequency decomposition. InCVPR, 2025
2025
-
[49]
Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset
Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. InCVPR, 2021
2021
-
[50]
Unipc: A unified predictor-corrector framework for fast sampling of diffusion models
Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. InNeurIPS, 2023
2023
-
[51]
LongCat-Video-Avatar HuMo TA VR (Ours) OmniAvatar StableAvatar EchoMimicV3
Tianyi Zhu, Dongwei Ren, Qilong Wang, Xiaohe Wu, and Wangmeng Zuo. Generative inbetweening through frame-wise conditions-driven video generation. InCVPR, 2025. 14 Research Generate Your Talking Avatar from Video Reference A Appendix This supplementary document provides further implementation details and additional ablation studies referenced in the main t...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.