Recognition: no theorem link
Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
Pith reviewed 2026-05-10 17:33 UTC · model grok-4.3
The pith
A method that learns emotion as the difference between audio and visual embeddings generates more accurate talking face videos with a wider range of expressions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
C-MET learns emotion semantic vectors as the difference between two different emotional embeddings across modalities using a large-scale pretrained audio encoder and a disentangled facial expression encoder. These vectors represent pure emotional content and are used to generate expressive talking face videos from speech, including for unseen extended emotions.
What carries the argument
Emotion semantic vectors, computed as the difference between embeddings from a pretrained audio encoder and a disentangled facial expression encoder, which isolate emotional content for transfer.
If this is right
- The generated videos achieve 14% higher emotion accuracy than state-of-the-art methods on MEAD and CREMA-D datasets.
- Extended emotions not seen during training can still be rendered in the output videos.
- The method works with speech input alone without requiring reference face images for the target emotion.
Where Pith is reading between the lines
- If the vectors successfully disentangle emotion, the method could generalize to editing emotions in real videos or other media like animations.
- Pretrained encoders allow the approach to benefit from advances in audio and vision models without full retraining.
- Testing on diverse speech styles could reveal if linguistic content is fully removed from the vectors.
Load-bearing premise
The differences between audio and visual embeddings capture only emotion and exclude any remaining linguistic or identity information.
What would settle it
Observing whether generated face videos correctly express the target emotion when the driving speech contains mismatched linguistic content or when applying emotions absent from the training set.
Figures
read the original abstract
Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Cross-Modal Emotion Transfer (C-MET) for emotion editing in talking face videos. It computes emotion semantic vectors as the difference between embeddings from a large-scale pretrained audio encoder and a disentangled facial expression encoder to transfer emotional content across modalities. The approach is claimed to overcome limitations of label-based, audio-based, and image-based methods, with experiments on MEAD and CREMA-D datasets reporting a 14% improvement in emotion accuracy over state-of-the-art methods and successful generalization to unseen extended emotions.
Significance. If the central assumption holds—that simple subtraction of embeddings isolates transferable pure emotion without residual linguistic or identity entanglement—the method could enable more flexible and expressive talking-face generation, particularly for emotions lacking discrete labels or suitable reference images. Public release of code, checkpoints, and demos is a clear strength that supports reproducibility and extension.
major comments (2)
- [Abstract (method description)] The central claim of a 14% emotion accuracy gain and generalization to unseen emotions rests on emotion semantic vectors faithfully capturing pure emotional content. The abstract provides no description of auxiliary losses, adversarial objectives, or post-hoc verification (e.g., probing for residual linguistic or speaker-identity factors in the difference vectors) that would enforce or measure this isolation.
- [Abstract (experiments)] No information is given on the specific baselines, exact emotion accuracy metric, statistical significance testing, or ablation studies that would substantiate the reported 14% improvement on MEAD and CREMA-D. Without these, the performance gain cannot be assessed as load-bearing evidence for the cross-modal transfer approach.
minor comments (2)
- [Abstract] The abstract states that the method 'generates expressive talking face videos - even for unseen extended emotions' but does not clarify how extended emotions are defined or sampled in the evaluation protocol.
- [Abstract] Notation for the emotion semantic vector (difference between audio and visual embeddings) is introduced without an accompanying equation or diagram in the provided abstract, which reduces clarity for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity on the method's isolation mechanism and experimental details.
read point-by-point responses
-
Referee: [Abstract (method description)] The central claim of a 14% emotion accuracy gain and generalization to unseen emotions rests on emotion semantic vectors faithfully capturing pure emotional content. The abstract provides no description of auxiliary losses, adversarial objectives, or post-hoc verification (e.g., probing for residual linguistic or speaker-identity factors in the difference vectors) that would enforce or measure this isolation.
Authors: We agree the abstract is concise and could better signal the isolation approach. The full manuscript (Section 3) describes the disentangled facial expression encoder trained with reconstruction and contrastive auxiliary losses to separate emotion from linguistic content and identity, enabling the cross-modal subtraction. Generalization results on unseen emotions and ablation studies removing these losses provide supporting evidence for effective isolation. We will revise the abstract to briefly note the role of the disentangled encoder and auxiliary objectives in achieving pure emotion transfer. revision: yes
-
Referee: [Abstract (experiments)] No information is given on the specific baselines, exact emotion accuracy metric, statistical significance testing, or ablation studies that would substantiate the reported 14% improvement on MEAD and CREMA-D. Without these, the performance gain cannot be assessed as load-bearing evidence for the cross-modal transfer approach.
Authors: The abstract summarizes the headline result, while Section 4 provides the full experimental details: comparisons against multiple SOTA baselines (both audio-driven and image-driven methods), emotion accuracy computed via a pretrained classifier on generated frames, mean accuracies with standard deviations, paired t-tests for significance, and dedicated ablation tables. We will revise the abstract to specify the evaluation metric and note that comprehensive baselines, significance testing, and ablations appear in the experiments section. revision: yes
Circularity Check
No circularity: empirical validation on external datasets
full rationale
The paper defines emotion semantic vectors explicitly as the difference between embeddings from a large-scale pretrained audio encoder and a disentangled facial expression encoder, then uses this construction to drive cross-modal transfer. However, the central performance claim (14% emotion accuracy gain and generalization to unseen emotions) is not derived by construction from this definition; it is instead reported from separate experiments on the public MEAD and CREMA-D datasets. No equations, self-citations, or fitted parameters are shown to reduce the accuracy improvement to a tautology or to prior author work. The method therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A large-scale pretrained audio encoder extracts emotion information that is sufficiently disentangled from linguistic content.
- domain assumption A disentangled facial expression encoder can produce embeddings whose differences correspond to pure emotion changes.
invented entities (1)
-
emotion semantic vector
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Video rewrite: Driving visual speech with audio
Christoph Bregler, Michele Covell, and Malcolm Slaney. Video rewrite: Driving visual speech with audio. InSem- inal Graphics Papers: Pushing the Boundaries, Volume 2, pages 715–722. 2023. 3
2023
-
[2]
Iemocap: Interactive emotional dyadic motion capture database.Lan- guage Resources and Evaluation, 42:335–359, 2008
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower Provost, Samuel Kim, Jeannette Chang, Sungbok Lee, and Shrikanth Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Lan- guage Resources and Evaluation, 42:335–359, 2008. 2
2008
-
[3]
Crema-d: Crowd-sourced emotional multimodal actors dataset.IEEE transactions on affective computing, 5(4):377–390, 2014
Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. Crema-d: Crowd-sourced emotional multimodal actors dataset.IEEE transactions on affective computing, 5(4):377–390, 2014. 2, 6
2014
-
[4]
EmoKnob: Enhance voice cloning with fine-grained emotion control
Haozhe Chen, Run Chen, and Julia Hirschberg. EmoKnob: Enhance voice cloning with fine-grained emotion control. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8170–8180, Miami, Florida, USA, 2024. Association for Computational Linguis- tics. 1, 2, 6, 13
2024
-
[5]
Lip movements generation at a glance
Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. Lip movements generation at a glance. In Proceedings of the European conference on computer vision (ECCV), pages 520–535, 2018. 3
2018
-
[6]
Hierarchical cross-modal talking face generation with dynamic pixel-wise loss
Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7832–7841, 2019. 1, 3
2019
-
[7]
Talking-head generation with rhyth- mic head motion
Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. Talking-head generation with rhyth- mic head motion. InEuropean Conference on Computer Vi- sion, pages 35–51. Springer, 2020. 1, 3
2020
-
[8]
Vast: vivify your talk- ing avatar via zero-shot expressive facial style transfer
Liyang Chen, Zhiyong Wu, Runnan Li, Weihong Bao, Jun Ling, Xu Tan, and Sheng Zhao. Vast: vivify your talk- ing avatar via zero-shot expressive facial style transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2977–2987, 2023. 3
2023
-
[9]
Executing your commands via motion diffusion in latent space
Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023. 6
2023
-
[10]
Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions
Zhiyuan Chen, Jiajiong Cao, Zhiquan Chen, Yuming Li, and Chenguang Ma. Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 2403–2410, 2025. 3
2025
-
[11]
Out of time: auto- mated lip sync in the wild
Joon Son Chung and Andrew Zisserman. Out of time: auto- mated lip sync in the wild. InAsian conference on computer vision, pages 251–263. Springer, 2016. 6
2016
-
[12]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2, 7, 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Emohuman: Fine-grained emotion- controlled talking head generation via audio-text multimodal detangling
Qifeng Dai, Huidong Feng, Wendi Cui, Xinqi Cai, Yinglin Zheng, and Ming Zeng. Emohuman: Fine-grained emotion- controlled talking head generation via audio-text multimodal detangling. InProceedings of the 2025 International Confer- ence on Multimedia Retrieval, pages 145–154, 2025. 2, 3
2025
-
[14]
Motionlcm: Real-time controllable motion generation via latent consistency model
Wenxun Dai, Ling-Hao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, and Yansong Tang. Motionlcm: Real-time controllable motion generation via latent consistency model. InEuropean Conference on Computer Vision, pages 390–408. Springer,
-
[15]
Speech-driven facial animation using cas- caded gans for learning of motion and texture
Dipanjan Das, Sandika Biswas, Sanjana Sinha, and Brojesh- war Bhowmick. Speech-driven facial animation using cas- caded gans for learning of motion and texture. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 408–424. Springer, 2020. 1, 3
2020
-
[16]
An argument for basic emotions.Cognition & emotion, 6(3-4):169–200, 1992
Paul Ekman. An argument for basic emotions.Cognition & emotion, 6(3-4):169–200, 1992. 1
1992
-
[17]
Space: Speech-driven portrait animation with controllable expression
Gururani et al. Space: Speech-driven portrait animation with controllable expression. InICCV, 2023. 13
2023
-
[18]
Xu et al. Qwen2. 5-omni technical report.arXiv, 2025. 14
2025
-
[19]
Efficient emotional adaptation for audio-driven talking-head generation
Yuan Gan, Zongxin Yang, Xihang Yue, Lingyun Sun, and Yi Yang. Efficient emotional adaptation for audio-driven talking-head generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22634– 22645, 2023. 1, 2, 3, 5, 6
2023
-
[20]
Emotionally en- hanced talking face generation
Sahil Goyal, Sarthak Bhagat, Shagun Uppal, Hitkul Jangra, Yi Yu, Yifang Yin, and Rajiv Ratn Shah. Emotionally en- hanced talking face generation. InProceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice, pages 81–90,
-
[21]
Javier Hernandez, Jina Suh, Judith Amores, Kael Rowan, Gonzalo Ramos, and Mary Czerwinski. Affective conver- sational agents: understanding expectations and personal in- fluences.arXiv preprint arXiv:2310.12459, 2023. 1
-
[22]
Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6
2017
-
[23]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- 9 hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Eamm: One-shot emotional talking face via audio-based emotion-aware motion model
Xinya Ji, Hang Zhou, Kaisiyuan Wang, Qianyi Wu, Wayne Wu, Feng Xu, and Xun Cao. Eamm: One-shot emotional talking face via audio-based emotion-aware motion model. InACM SIGGRAPH 2022 conference proceedings, pages 1– 10, 2022. 1, 2, 3, 5, 6
2022
-
[25]
Taekyung Ki, Dongchan Min, and Gyeongsu Chae. Float: Generative motion latent flow matching for audio-driven talking portrait.arXiv preprint arXiv:2412.01064, 2024. 1, 2, 3, 5, 6
-
[26]
Expressive talking head gener- ation with granular audio-visual control
Borong Liang, Yan Pan, Zhizhi Guo, Hang Zhou, Zhibin Hong, Xiaoguang Han, Junyu Han, Jingtuo Liu, Errui Ding, and Jingdong Wang. Expressive talking head gener- ation with granular audio-visual control. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3387–3396, 2022. 3
2022
-
[27]
Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait an- imation
Yukang Lin, Hokit Fung, Jianjin Xu, Zeping Ren, Adela SM Lau, Guosheng Yin, and Xiu Li. Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait an- imation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26242–26252, 2025. 2, 3
2025
-
[28]
Moee: Mixture of emotion experts for audio-driven portrait animation
Huaize Liu, Wenzhang Sun, Donglin Di, Shibo Sun, Jiahui Yang, Changqing Zou, and Hujun Bao. Moee: Mixture of emotion experts for audio-driven portrait animation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 26222–26231, 2025. 1, 2, 3
2025
-
[29]
Semantic-aware implicit neural audio- driven video portrait generation
Xian Liu, Yinghao Xu, Qianyi Wu, Hang Zhou, Wayne Wu, and Bolei Zhou. Semantic-aware implicit neural audio- driven video portrait generation. InEuropean conference on computer vision, pages 106–125. Springer, 2022. 3
2022
-
[30]
Building naturalistic emo- tionally balanced speech corpus by retrieving emotional speech from existing podcast recordings.IEEE Transactions on Affective Computing, PP:1–1, 2017
Reza Lotfian and Carlos Busso. Building naturalistic emo- tionally balanced speech corpus by retrieving emotional speech from existing podcast recordings.IEEE Transactions on Affective Computing, PP:1–1, 2017. 2
2017
-
[31]
Styletalk: One-shot talking head generation with controllable speaking styles
Yifeng Ma, Suzhen Wang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding, Zhidong Deng, and Xin Yu. Styletalk: One-shot talking head generation with controllable speaking styles. InProceedings of the AAAI conference on artificial intelligence, pages 1896–1904, 2023. 3
1904
-
[32]
emotion2vec: Self- supervised pre-training for speech emotion representation
Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. emotion2vec: Self- supervised pre-training for speech emotion representation. Proc. ACL 2024 Findings, 2024. 2, 4, 5, 14
2024
-
[33]
Ma-avt: Modality alignment for parameter- efficient audio-visual transformers
Tanvir Mahmud, Shentong Mo, Yapeng Tian, and Diana Marculescu. Ma-avt: Modality alignment for parameter- efficient audio-visual transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7996–8005, 2024. 4
2024
-
[34]
Frame attention networks for facial expression recognition in videos
Debin Meng, Xiaojiang Peng, Kai Wang, and Yu Qiao. Frame attention networks for facial expression recognition in videos. In2019 IEEE international conference on image processing (ICIP), pages 3866–3870. IEEE, 2019. 6
2019
-
[35]
Dpe: Dis- entanglement of pose and expression for general video por- trait editing
Youxin Pang, Yong Zhang, Weize Quan, Yanbo Fan, Xi- aodong Cun, Ying Shan, and Dong-ming Yan. Dpe: Dis- entanglement of pose and expression for general video por- trait editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 427–436,
-
[36]
Ai- generated characters for supporting personalized learning and well-being.Nature Machine Intelligence, 3(12):1013– 1022, 2021
Pat Pataranutaporn, Valdemar Danry, Joanne Leong, Parinya Punpongsanon, Dan Novy, Pattie Maes, and Misha Sra. Ai- generated characters for supporting personalized learning and well-being.Nature Machine Intelligence, 3(12):1013– 1022, 2021. 1
2021
-
[37]
Emotalk: Speech-driven emotional disentanglement for 3d face anima- tion
Ziqiao Peng, Haoyu Wu, Zhenbo Song, Hao Xu, Xiangyu Zhu, Jun He, Hongyan Liu, and Zhaoxin Fan. Emotalk: Speech-driven emotional disentanglement for 3d face anima- tion. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 20687–20697, 2023. 3
2023
-
[38]
MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations
Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations.arXiv preprint arXiv:1810.02508, 2018. 1
work page Pith review arXiv 2018
-
[39]
A lip sync expert is all you need for speech to lip generation in the wild
KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Nambood- iri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. InProceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020. 3
2020
-
[40]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3
2021
-
[41]
Empathy in virtual agents: How emotional expressions can influence user perception.ICAT- EGVE, 2024
Sebastian Rings, Susanne Schmidt, Julia Janßen, Nale Lehmann-Willenbrock, Frank Steinicke, S Hasegawa, N Sakata, and V Sundstedt. Empathy in virtual agents: How emotional expressions can influence user perception.ICAT- EGVE, 2024. 1
2024
-
[42]
Nastaran Saffaryazdi, Tamil Selvan Gunasekaran, Kate Loveys, Elizabeth Broadbent, and Mark Billinghurst. Em- pathetic conversational agents: Utilizing neural and physi- ological signals for enhanced empathetic interactions.In- ternational Journal of Human–Computer Interaction, pages 1–25, 2025. 1
2025
-
[43]
Learning dynamic facial radiance fields for few-shot talking head synthesis
Shuai Shen, Wanhua Li, Zheng Zhu, Yueqi Duan, Jie Zhou, and Jiwen Lu. Learning dynamic facial radiance fields for few-shot talking head synthesis. InEuropean conference on computer vision, pages 666–682. Springer, 2022. 3
2022
-
[44]
Difftalk: Crafting diffusion models for generalized audio-driven portraits animation
Shuai Shen, Wenliang Zhao, Zibin Meng, Wanhua Li, Zheng Zhu, Jie Zhou, and Jiwen Lu. Difftalk: Crafting diffusion models for generalized audio-driven portraits animation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1982–1991, 2023. 3
1982
-
[45]
Emotalker: Audio driven emotion aware talking head generation
Xiaoqian Shen, Faizan Farooq Khan, and Mohamed Elho- seiny. Emotalker: Audio driven emotion aware talking head generation. InProceedings of the Asian Conference on Com- puter Vision, pages 1900–1917, 2024. 2, 3
1900
-
[46]
Xuli Shen, Hua Cai, Dingding Yu, Weilin Shen, Qing Xu, and Xiangyang Xue. Emohead: Emotional talking head via manipulating semantic expression parameters.arXiv preprint arXiv:2503.19416, 2025. 2, 3
-
[47]
Continuously 10 controllable facial expression editing in talking face videos
Zhiyao Sun, Yu-Hui Wen, Tian Lv, Yanan Sun, Ziyang Zhang, Yaoyuan Wang, and Yong-Jin Liu. Continuously 10 controllable facial expression editing in talking face videos. IEEE Transactions on Affective Computing, 15(3):1400– 1413, 2023. 1, 2, 3, 6
2023
-
[48]
Fg- emotalk: Talking head video generation with fine-grained controllable facial expressions
Zhaoxu Sun, Yuze Xuan, Fang Liu, and Yang Xiang. Fg- emotalk: Talking head video generation with fine-grained controllable facial expressions. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5043–5051,
-
[49]
Emmn: Emotional motion memory network for audio-driven emotional talking face generation
Shuai Tan, Bin Ji, and Ye Pan. Emmn: Emotional motion memory network for audio-driven emotional talking face generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22146–22156, 2023. 2, 3
2023
-
[50]
Edtalk: Effi- cient disentanglement for emotional talking head synthesis
Shuai Tan, Bin Ji, Mengxiao Bi, and Ye Pan. Edtalk: Effi- cient disentanglement for emotional talking head synthesis. InEuropean Conference on Computer Vision, pages 398–
-
[51]
1, 2, 3, 4, 5, 6, 7, 15
Springer, 2024. 1, 2, 3, 4, 5, 6, 7, 15
2024
-
[52]
Style2talker: High-resolution talking head generation with emotion style and art style
Shuai Tan, Bin Ji, and Ye Pan. Style2talker: High-resolution talking head generation with emotion style and art style. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 5079–5087, 2024. 3
2024
-
[53]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Neural voice puppetry: Audio-driven facial reenactment
Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. Neural voice puppetry: Audio-driven facial reenactment. InEuropean conference on computer vision, pages 716–731. Springer, 2020. 3
2020
-
[55]
Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions
Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. In European Conference on Computer Vision, pages 244–260. Springer, 2024. 2, 3
2024
-
[56]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 6
work page internal anchor Pith review arXiv 2018
-
[57]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 5
2017
-
[58]
Progressive disentangled representation learning for fine-grained controllable talking head synthesis
Duomin Wang, Yu Deng, Zixin Yin, Heung-Yeung Shum, and Baoyuan Wang. Progressive disentangled representation learning for fine-grained controllable talking head synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17979–17989, 2023. 3, 7, 15
2023
-
[59]
Eat-face: Emotion-controllable audio-driven talking face generation via diffusion model
Haodi Wang, Xiaojun Jia, and Xiaochun Cao. Eat-face: Emotion-controllable audio-driven talking face generation via diffusion model. In2024 IEEE 18th International Con- ference on Automatic Face and Gesture Recognition (FG), pages 1–10. IEEE, 2024. 2, 3
2024
-
[60]
Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion
Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, et al. Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26212–26221, 2025. 2, 3
2025
-
[61]
Lipformer: High- fidelity and generalizable talking face generation with a pre- learned facial codebook
Jiayu Wang, Kang Zhao, Shiwei Zhang, Yingya Zhang, Yu- jun Shen, Deli Zhao, and Jingren Zhou. Lipformer: High- fidelity and generalizable talking face generation with a pre- learned facial codebook. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13844–13853, 2023. 3
2023
-
[62]
Mead: A large-scale audio-visual dataset for emotional talking-face generation
Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. Mead: A large-scale audio-visual dataset for emotional talking-face generation. InEuropean conference on com- puter vision, pages 700–717. Springer, 2020. 2, 3, 6
2020
-
[63]
Audio2head: Audio-driven one-shot talking-head generation with natural head motion
S Wang, L Li, Y Ding, C Fan, and X Yu. Audio2head: Audio-driven one-shot talking-head generation with natural head motion. InInternational Joint Conference on Artificial Intelligence. IJCAI, 2021. 1, 3
2021
-
[64]
One- shot talking face generation from single-speaker audio-visual correlation learning
Suzhen Wang, Lincheng Li, Yu Ding, and Xin Yu. One- shot talking face generation from single-speaker audio-visual correlation learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2531–2539, 2022. 1, 3
2022
-
[65]
Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Pro- cessing Systems, 37:660–684, 2024
Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Pro- cessing Systems, 37:660–684, 2024. 3
2024
-
[66]
Face2faceρ: Real- time high-resolution one-shot face reenactment
Kewei Yang, Kang Chen, Daoliang Guo, Song-Hai Zhang, Yuan-Chen Guo, and Weidong Zhang. Face2faceρ: Real- time high-resolution one-shot face reenactment. InEuropean conference on computer vision, pages 55–71. Springer, 2022. 1, 3
2022
-
[67]
Styleheat: One-shot high-resolution ed- itable talking face generation via pre-trained stylegan
Fei Yin, Yong Zhang, Xiaodong Cun, Mingdeng Cao, Yanbo Fan, Xuan Wang, Qingyan Bai, Baoyuan Wu, Jue Wang, and Yujiu Yang. Styleheat: One-shot high-resolution ed- itable talking face generation via pre-trained stylegan. In European conference on computer vision, pages 85–101. Springer, 2022. 3
2022
-
[68]
Talking head generation with probabilistic audio-to-visual diffusion priors
Zhentao Yu, Zixin Yin, Deyu Zhou, Duomin Wang, Finn Wong, and Baoyuan Wang. Talking head generation with probabilistic audio-to-visual diffusion priors. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7645–7655, 2023. 3
2023
-
[69]
Few-shot adversarial learning of realistic neural talking head models
Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. Few-shot adversarial learning of realistic neural talking head models. InProceedings of the IEEE/CVF international conference on computer vision, pages 9459– 9468, 2019. 1, 3
2019
-
[70]
Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset
Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021. 6
2021
-
[71]
Identity- preserving talking face generation with landmark and ap- pearance priors
Weizhi Zhong, Chaowei Fang, Yinqi Cai, Pengxu Wei, Gangming Zhao, Liang Lin, and Guanbin Li. Identity- preserving talking face generation with landmark and ap- pearance priors. InProceedings of the IEEE/CVF Confer- 11 ence on Computer Vision and Pattern Recognition, pages 9729–9738, 2023. 1, 3, 4
2023
-
[72]
Talking face generation by adversarially disentangled audio-visual representation
Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. Talking face generation by adversarially disentangled audio-visual representation. InProceedings of the AAAI con- ference on artificial intelligence, pages 9299–9306, 2019. 3
2019
-
[73]
Pose-controllable talking face generation by implicitly modularized audio-visual rep- resentation
Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. Pose-controllable talking face generation by implicitly modularized audio-visual rep- resentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4176–4186,
-
[74]
Say with{emotion}voice: {sentence}
Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevar- ria, Evangelos Kalogerakis, and Dingzeyu Li. Makelttalk: speaker-aware talking-head animation.ACM Transactions On Graphics (TOG), 39(6):1–15, 2020. 1, 3 12 Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video Supplementary Material In this supplementary material, we first describe how ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.