arxiv: 2604.27918 · v1 · submitted 2026-04-30 · 💻 cs.CV

Recognition: unknown

Generate Your Talking Avatar from Video Reference

Zujin Guo , Zhenhui Ye , Yi Ren , Yuanming Li , Ce Chen , Zhibin Hong , Chen Change Loy

Authors on Pith no claims yet

Pith reviewed 2026-05-07 05:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords talking avatarvideo referencecross-scene generationtoken selectionreinforcement learningavatar synthesislip synchronization

0 comments

The pith

TAVR generates talking avatars from video references across different scenes using token selection and staged training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard talking avatar systems condition generation on one static reference image from the same scene, which supplies limited expression and motion information and restricts output to matched backgrounds. This paper shifts to video references drawn from different scenes to supply richer temporal context. A token selection module filters the longer input sequence while a three-stage training process first copies appearance in same-scene videos, then adapts to cross-scene differences, and finally uses identity reinforcement learning to sharpen likeness. The resulting system produces avatars in arbitrary backgrounds with improved lip synchronization and natural motion. A new benchmark of 158 cross-scene video pairs is introduced to measure these gains, and experiments show consistent outperformance over prior single-image methods.

Core claim

By replacing the conventional single-image, same-scene reference with cross-scene video inputs, TAVR demonstrates that a token selection module together with same-scene pretraining, cross-scene fine-tuning, and identity reinforcement learning can bridge large domain gaps and synthesize high-fidelity talking avatars that preserve temporal coherence and lip synchronization.

What carries the argument

Token selection module that extracts relevant tokens from extended cross-scene video sequences, combined with the three-stage training scheme of same-scene pretraining for appearance copying, cross-scene fine-tuning for domain adaptation, and identity-based reinforcement learning for final alignment.

If this is right

Generation becomes possible with any suitable video reference at inference time rather than a matched static image.
Quantitative and qualitative performance exceeds prior same-scene image-conditioned methods on cross-scene tasks.
High-fidelity avatars can be placed in fully customized backgrounds while retaining identity and expression fidelity.
The introduced 158-pair benchmark provides a standardized test for evaluating cross-scene robustness in talking-avatar systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Everyday user videos recorded in varied lighting and settings could serve directly as references without requiring studio-matched footage.
The staged adaptation pattern may transfer to other video-generation problems that face large domain shifts between conditioning and target content.
The approach opens the possibility of on-demand avatar creation for applications such as personalized video messages or virtual meetings.

Load-bearing premise

The token selection module and three-stage training sequence can reliably bridge large visual and temporal differences between reference and target scenes without introducing artifacts or breaking lip synchronization and motion naturalness.

What would settle it

If quantitative metrics on the 158-pair cross-scene benchmark show no improvement over image-based baselines or if generated videos display visible lip-sync errors and motion artifacts when the reference video comes from a different scene, the central claim would be falsified.

read the original abstract

Existing talking avatar methods typically adopt an image-to-video pipeline conditioned on a static reference image within the same scene as the target generation. This restricted, single-view perspective lacks sufficient temporal and expression cues, limiting the ability to synthesize high-fidelity talking avatars in customized backgrounds. To this end, we introduce Talking Avatar generation from Video Reference (TAVR), a novel framework that shifts the paradigm by leveraging cross-scene video inputs. To effectively process these extended temporal contexts and bridge cross-scene domain gaps, TAVR integrates a token selection module alongside a comprehensive three-stage training scheme. Specifically, same-scene video pretraining establishes foundational appearance copying, which is subsequently expanded by cross-scene reference fine-tuning for robust cross-scene adaptation. Finally, task-specific reinforcement learning aligns the generated outputs with identity-based rewards to maximize identity similarity. To systematically evaluate cross-scene robustness, we construct a new benchmark comprising 158 carefully curated cross-scene video pairs. Extensive experiments show that TAVR benefits from flexible inference-time video referencing and consistently surpasses existing baselines both quantitatively and qualitatively. This work has been deployed to production. For more related research, please visit \href{https://www.heygen.com/research}{HeyGen Research} and \href{https://www.heygen.com/research/avatar-v-model}{HeyGen Avatar-V}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TAVR is a practical deployed system for cross-scene video-referenced talking avatars using token selection and three-stage training, but the writeup stays light on numbers and ablations.

read the letter

Hey, the main takeaway is that this paper gives you a working pipeline for generating talking avatars from a full video reference shot in a different scene. They add a token selection module to pull relevant parts from the longer input and train in three stages: same-scene pretraining to learn basic copying, cross-scene fine-tuning to adapt, and then RL with identity rewards to tighten the match. The fact that it is already running in production at HeyGen suggests the authors have at least made it stable enough for real use. They also put together a 158-pair cross-scene benchmark, which is a reasonable step beyond the usual same-scene image conditioning that most prior work uses. That shift to video reference is the clearest new piece, and the staged training plus token selection is a straightforward engineering response to the domain gap and context length problems. The paper does a clean job laying out the full pipeline and explaining why each stage is there, which makes it easy for someone trying to build similar tools to follow the logic. On the soft spots, the abstract and summary give no concrete metrics, no error bars, and no ablations that isolate what the token selection or the RL stage actually contributes. Without those numbers it is hard to judge whether the method truly keeps lip sync and motion natural across big scene changes or whether the reported wins partly reflect how the test pairs were chosen. The stress-test note is right to flag this; the central claim about reliable adaptation rests on evidence that is not shown in the high-level description. If the full paper has solid tables and implementation details, that would fix most of the concern. This is aimed at people working on avatar tools in industry or applied labs who need cross-scene flexibility. A reader chasing new theory in general video generation will not find much here, but someone looking for patterns in conditional training pipelines could pick up the staged approach. I would bring it to a reading group focused on media synthesis, and I think it deserves peer review because the deployment and benchmark give it enough substance to be worth checking the experiments in detail.

Referee Report

4 major / 2 minor

Summary. The paper introduces Talking Avatar generation from Video Reference (TAVR), a framework for synthesizing talking avatars conditioned on cross-scene video references rather than single same-scene images. It proposes a token selection module to process extended temporal contexts and a three-stage training pipeline (same-scene pretraining for appearance copying, cross-scene fine-tuning for domain adaptation, and identity reinforcement learning). A new benchmark of 158 curated cross-scene video pairs is introduced for evaluation, with claims of quantitative and qualitative superiority over baselines and production deployment at HeyGen.

Significance. If the empirical claims are substantiated with concrete metrics and ablations, TAVR could meaningfully advance talking-head video synthesis by relaxing the single-image, same-scene constraint and enabling more flexible, temporally rich references. The staged training plus RL alignment addresses a recognized practical challenge in cross-domain avatar generation, and the new benchmark fills a gap in cross-scene evaluation. Production deployment is a positive indicator of real-world utility, though the absence of reported numbers currently limits assessment of effect size and robustness.

major comments (4)

Abstract: the claim that TAVR 'consistently surpasses existing baselines both quantitatively and qualitatively' on the 158-pair benchmark supplies no numerical results (e.g., FID, lip-sync error, temporal consistency scores), no error bars, and no ablation tables isolating the token selection module or each training stage. This information is load-bearing for the central empirical claim.
Method section (token selection module): the architecture, selection criterion, and integration of the token selection module are not described with equations or pseudocode. Without these details it is impossible to determine how the module bridges large cross-scene domain gaps while preserving lip synchronization and motion coherence.
Training and Experiments sections: no ablation results are presented that quantify the incremental contribution of same-scene pretraining, cross-scene fine-tuning, and identity RL, nor are the RL reward weights or hyperparameters reported. This leaves the weakest assumption—that the three-stage scheme reliably closes domain gaps without introducing artifacts—unverified.
Benchmark construction: the 158-pair dataset is described only as 'carefully curated'; no selection protocol, difficulty stratification, or comparison against existing talking-head datasets is provided. Post-hoc curation raises the possibility that reported gains reflect benchmark composition rather than algorithmic robustness.

minor comments (2)

Abstract: the production-deployment statement would be strengthened by a brief mention of the specific metrics used in internal validation.
References: several recent works on video-conditioned avatar generation and RL for identity preservation are not cited; adding them would better situate the contribution.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the positive assessment of the potential impact of TAVR, the staged training approach, and the new cross-scene benchmark. We address each major comment below and commit to revising the manuscript to incorporate the requested clarifications, metrics, and details.

read point-by-point responses

Referee: Abstract: the claim that TAVR 'consistently surpasses existing baselines both quantitatively and qualitatively' on the 158-pair benchmark supplies no numerical results (e.g., FID, lip-sync error, temporal consistency scores), no error bars, and no ablation tables isolating the token selection module or each training stage. This information is load-bearing for the central empirical claim.

Authors: We agree that the abstract should include concrete numerical support for the central claims. In the revised version we will add the key quantitative results (FID, lip-sync error, temporal consistency) with error bars and will explicitly reference the ablation findings on the token selection module and training stages. revision: yes
Referee: Method section (token selection module): the architecture, selection criterion, and integration of the token selection module are not described with equations or pseudocode. Without these details it is impossible to determine how the module bridges large cross-scene domain gaps while preserving lip synchronization and motion coherence.

Authors: We acknowledge that the token selection module requires a more formal specification. The revised manuscript will include the missing equations for the selection criterion and integration, together with pseudocode that shows how temporal tokens are chosen and injected while maintaining lip-sync and motion coherence. revision: yes
Referee: Training and Experiments sections: no ablation results are presented that quantify the incremental contribution of same-scene pretraining, cross-scene fine-tuning, and identity RL, nor are the RL reward weights or hyperparameters reported. This leaves the weakest assumption—that the three-stage scheme reliably closes domain gaps without introducing artifacts—unverified.

Authors: We will add a dedicated ablation subsection that isolates the contribution of each training stage (same-scene pretraining, cross-scene fine-tuning, identity RL). We will also report the exact RL reward weights, learning-rate schedules, and other hyperparameters used in the identity reinforcement learning stage. revision: yes
Referee: Benchmark construction: the 158-pair dataset is described only as 'carefully curated'; no selection protocol, difficulty stratification, or comparison against existing talking-head datasets is provided. Post-hoc curation raises the possibility that reported gains reflect benchmark composition rather than algorithmic robustness.

Authors: We will expand the benchmark section with a detailed description of the curation protocol, the criteria used to select the 158 cross-scene pairs, the difficulty stratification applied, and direct comparisons against existing talking-head datasets to demonstrate that the reported gains are not an artifact of benchmark composition. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with independent benchmark evaluation

full rationale

The paper presents TAVR as an engineering framework consisting of a token selection module and a three-stage training process (same-scene pretraining, cross-scene fine-tuning, identity RL). No equations, derivations, or quantitative predictions are described that reduce to fitted parameters or self-referential definitions by construction. The central claims rest on experimental results against a newly constructed 158-pair cross-scene benchmark, which is an external evaluation set rather than a quantity defined by the method itself. No self-citation chains or uniqueness theorems are invoked as load-bearing support. This is a standard empirical ML paper whose improvements are not forced by internal definitions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the effectiveness of the newly introduced token selection module and three-stage training pipeline, which are postulated without first-principles derivation and depend on the empirical success of neural network adaptation across scenes.

free parameters (1)

RL reward weights and training hyperparameters
The reinforcement learning stage and staged fine-tuning involve numerous unspecified parameters fitted to data that directly affect identity alignment and cross-scene performance.

axioms (1)

domain assumption Neural networks can learn appearance copying from same-scene videos and then adapt to cross-scene inputs via fine-tuning.
This assumption underpins the first two training stages and is invoked without proof in the abstract description of the pipeline.

invented entities (2)

token selection module no independent evidence
purpose: To process extended temporal contexts from video references and bridge cross-scene domain gaps.
New component introduced by the paper to handle video input; no independent evidence outside the framework is provided.
three-stage training scheme no independent evidence
purpose: To progressively build appearance copying, cross-scene adaptation, and identity alignment via RL.
Custom training pipeline proposed in the paper without derivation from prior theory.

pith-pipeline@v0.9.0 · 5545 in / 1664 out tokens · 128608 ms · 2026-05-07T05:27:31.134925+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 16 canonical work pages · 3 internal anchors

[1]

wav2vec 2.0: A framework for self- supervised learning of speech representations

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self- supervised learning of speech representations. InNeurIPS, 2020

2020
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review arXiv 2023
[3]

OmniVCus: Feedforward subject-driven video customization with multimodal control conditions

Yuanhao Cai, He Zhang, Xi Chen, Jinbo Xing, Yiwei Hu, Yuqian Zhou, Kai Zhang, Zhifei Zhang, Soo Ye Kim, Tianyu Wang, et al. OmniVCus: Feedforward subject-driven video customization with multimodal control conditions. InNeurIPS, 2025

2025
[4]

X-Dyna: Expressive dynamic human image animation

Di Chang, Hongyi Xu, You Xie, Yipeng Gao, Zhengfei Kuang, Shengqu Cai, Chenxu Zhang, Guoxian Song, Chao Wang, Yichun Shi, et al. X-Dyna: Expressive dynamic human image animation. InCVPR, 2025

2025
[5]

HuMo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025

Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. HuMo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025

work page arXiv 2025
[6]

TalkVid: A large-scale diversified dataset for audio-driven talking head synthesis.arXiv preprint arXiv:2508.13618, 2025

Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, et al. TalkVid: A large-scale diversified dataset for audio-driven talking head synthesis.arXiv preprint arXiv:2508.13618, 2025

work page arXiv 2025
[7]

Multi-subject open-set personalization in video generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. Multi-subject open-set personalization in video generation. InCVPR, 2025

2025
[8]

HunyuanVideo-Avatar: High-fidelity audio-driven human animation for multiple characters.arXiv preprint arXiv:2505.20156, 2025

Yi Chen, Sen Liang, Zixiang Zhou, Ziyao Huang, Yifeng Ma, Junshu Tang, Qin Lin, Yuan Zhou, and Qinglin Lu. HunyuanVideo-Avatar: High-fidelity audio-driven human animation for multiple characters.arXiv preprint arXiv:2505.20156, 2025

work page arXiv 2025
[9]

CoRR , volume =

Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Ju Li, Dechao Meng, Jinwei Qi, Penchong Qiao, et al. Wan-Animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025

work page arXiv 2025
[10]

UniMax: Fairer and more effective language sampling for large-scale multilingual pretraining

Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. UniMax: Fairer and more effective language sampling for large-scale multilingual pretraining. InICLR, 2023

2023
[11]

J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. InWorkshop on Multi-view Lip-reading, ACCV, 2016

2016
[12]

Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation.arXiv preprint arXiv:2505.23525, 2025

Jiahao Cui, Yan Chen, Mingwang Xu, Hanlin Shang, Yuxuan Chen, Yun Zhan, Zilong Dong, Yao Yao, Jingdong Wang, and Siyu Zhu. Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation.arXiv preprint arXiv:2505.23525, 2025

work page arXiv 2025
[13]

Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer

Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. In CVPR, 2025

2025
[14]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InCVPR, 2019. 12 Research Generate Your Talking Avatar from Video Reference

2019
[15]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

2024
[16]

Skyreels-a2: Compose anything in video diffusion transformers

Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. SkyReels-A2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025

work page arXiv 2025
[17]

OmniAvatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025

Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, and Steven Hoi. OmniAvatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025

work page arXiv 2025
[18]

Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025

Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, et al. Wan-S2V: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025

work page arXiv 2025
[19]

Controllable human-centric keyframe interpolation with generative prior

Zujin Guo, Size Wu, Zhongang Cai, Wei Li, and Chen Change Loy. Controllable human-centric keyframe interpolation with generative prior. InNeurIPS, 2025

2025
[20]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshop on Deep Generative Models and Downstream Applications, 2021

2021
[21]

VideoMage: Multi-subject and motion customization of text-to-video diffusion models

Chi-Pin Huang, Yen-Siang Wu, Hung-Kai Chung, Kai-Po Chang, Fu-En Yang, and Yu-Chiang Frank Wang. VideoMage: Multi-subject and motion customization of text-to-video diffusion models. InCVPR, 2025

2025
[22]

ConceptMaster: Multi-concept video customization on diffusion transformer models without test-time tuning.arXiv preprint arXiv:2501.04698, 2025

Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, and Kun Gai. ConceptMaster: Multi-concept video customization on diffusion transformer models without test-time tuning.arXiv preprint arXiv:2501.04698, 2025

work page arXiv 2025
[23]

Ultralytics YOLO, January 2023

Glenn Jocher, Jing Qiu, and Ayush Chaurasia. Ultralytics YOLO, January 2023. URLhttps://github.com/ ultralytics/ultralytics

2023
[24]

Sapiens: Foundation for human vision models

Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vision models. InECCV, 2024

2024
[25]

Slot-ID: Identity-preserving video generation from reference videos via slot-based temporal identity encoding.arXiv preprint arXiv:2601.01352, 2026

Yixuan Lai, He Wang, Kun Zhou, and Tianjia Shao. Slot-ID: Identity-preserving video generation from reference videos via slot-based temporal identity encoding.arXiv preprint arXiv:2601.01352, 2026

work page arXiv 2026
[26]

InfinityHuman: Towards long-term audio-driven human animation.arXiv preprint arXiv:2508.20210, 2025

Xiaodi Li, Pan Xie, Yi Ren, Qijun Gan, Chen Zhang, Fangyuan Kong, Xiang Yin, Bingyue Peng, and Zehuan Yuan. InfinityHuman: Towards long-term audio-driven human animation.arXiv preprint arXiv:2508.20210, 2025

work page arXiv 2025
[27]

BindWeave: Subject-consistent video generation via cross-modal integration

Zhaoyang Li, Dongjun Qian, Kai Su, Qishuai Diao, Xiangyang Xia, Chang Liu, Wenfei Yang, Tianzhu Zhang, and Zehuan Yuan. BindWeave: Subject-consistent video generation via cross-modal integration. InICLR, 2025

2025
[28]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2022

2022
[29]

Improving video generation with human feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback. InNeurIPS, 2025

2025
[30]

VideoDPO: Omni-preference alignment for video diffusion generation

Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. VideoDPO: Omni-preference alignment for video diffusion generation. InCVPR, 2025

2025
[31]

Playmate2: Training-free multi-character audio-driven animation via diffusion transformer with reward feedback

Xingpei Ma, Shenneng Huang, Jiaran Cai, Yuansheng Guan, Shen Zheng, Hanfeng Zhao, Qiang Zhang, and Shunsi Zhang. Playmate2: Training-free multi-character audio-driven animation via diffusion transformer with reward feedback. InAAAI, 2026

2026
[32]

Styletalk: One-shot talking head generation with controllable speaking styles

Yifeng Ma, Suzhen Wang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding, Zhidong Deng, and Xin Yu. Styletalk: One-shot talking head generation with controllable speaking styles. InAAAI, 2023

2023
[33]

EchomimicV2: Towards striking, simplified, and semi-body human animation

Rang Meng, Xingyu Zhang, Yuming Li, and Chenguang Ma. EchomimicV2: Towards striking, simplified, and semi-body human animation. InCVPR, 2025

2025
[34]

EchomimicV3: 1.3 b parameters are all you need for unified multi-modal and multi-task human animation

Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, and Chenguang Ma. EchomimicV3: 1.3 b parameters are all you need for unified multi-modal and multi-task human animation. InAAAI, 2026

2026
[35]

EmoTalk: Speech-driven emotional disentanglement for 3d face animation

Ziqiao Peng, Haoyu Wu, Zhenbo Song, Hao Xu, Xiangyu Zhu, Jun He, Hongyan Liu, and Zhaoxin Fan. EmoTalk: Speech-driven emotional disentanglement for 3d face animation. InICCV, 2023. 13 Research Generate Your Talking Avatar from Video Reference

2023
[36]

DualTalk: Dual-speaker interaction for 3d talking head conversations

Ziqiao Peng, Yanbo Fan, Haoyu Wu, Xuan Wang, Hongyan Liu, Jun He, and Zhaoxin Fan. DualTalk: Dual-speaker interaction for 3d talking head conversations. InCVPR, 2025

2025
[37]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS, 2023

2023
[38]

Longcat-video-avatar technical report

Meituan LongCat Team. Longcat-video-avatar technical report. https://github.com/meituan-longcat/ LongCat-Video/blob/main/assets/LongCat-Video-Avatar-Tech-Report.pdf, 2025

2025
[39]

EMO: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions

Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. EMO: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. InECCV, 2024

2024
[40]

StableAvatar: Infinite-length audio-driven avatar video generation.arXiv preprint arXiv:2508.08248, 2025

Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Chong Luo, Zuxuan Wu, and Yu- Gang Jiang. StableAvatar: Infinite-length audio-driven avatar video generation.arXiv preprint arXiv:2508.08248, 2025

work page arXiv 2025
[41]

StableAnimator: High-quality identity-preserving human image animation

Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu. StableAnimator: High-quality identity-preserving human image animation. InCVPR, 2025

2025
[42]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In CVPR, 2024

2024
[43]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review arXiv 2025
[44]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review arXiv 2025
[45]

Hallo: Hierarchical audio-driven visual synthesis for portrait image animation.arXiv preprint arXiv:2406.08801, 2024

Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, and Siyu Zhu. Hallo: Hierarchical audio-driven visual synthesis for portrait image animation.arXiv preprint arXiv:2406.08801, 2024

work page arXiv 2024
[46]

GeneFace: Generalized and high-fidelity audio-driven 3d talking face synthesis

Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, Jinzheng He, and Zhou Zhao. GeneFace: Generalized and high-fidelity audio-driven 3d talking face synthesis. InICLR, 2023

2023
[47]

MimicTalk: Mimicking a personalized and expressive 3d talking face in minutes

Zhenhui Ye, Tianyun Zhong, Yi Ren, Ziyue Jiang, Jiawei Huang, Rongjie Huang, Jinglin Liu, Jinzheng He, Chen Zhang, Zehan Wang, et al. MimicTalk: Mimicking a personalized and expressive 3d talking face in minutes. NeurIPS, 2024

2024
[48]

Identity-preserving text-to-video generation by frequency decomposition

Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity-preserving text-to-video generation by frequency decomposition. InCVPR, 2025

2025
[49]

Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset

Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. InCVPR, 2021

2021
[50]

Unipc: A unified predictor-corrector framework for fast sampling of diffusion models

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. InNeurIPS, 2023

2023
[51]

LongCat-Video-Avatar HuMo TA VR (Ours) OmniAvatar StableAvatar EchoMimicV3

Tianyi Zhu, Dongwei Ren, Qilong Wang, Xiaohe Wu, and Wangmeng Zuo. Generative inbetweening through frame-wise conditions-driven video generation. InCVPR, 2025. 14 Research Generate Your Talking Avatar from Video Reference A Appendix This supplementary document provides further implementation details and additional ablation studies referenced in the main t...

2025