arxiv: 2604.14541 · v1 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

Giving Faces Their Feelings Back: Explicit Emotion Control for Feedforward Single-Image 3D Head Avatars

Yicheng Gong , Jiawei Zhang , Liqiang Liu , Yanwen Wang , Lei Chu , Jiahao Li , Hao Pan , Hao Zhu

show 1 more author

Yan Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D head avatarsemotion controlsingle-image reconstructionfeedforward networksdisentangled manipulationfacial animationemotion transfermodulation mechanism

0 comments

The pith

Emotion can be treated as an independent control signal in single-image 3D head avatars.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method to control emotions separately in 3D head avatars built from one photo, rather than letting them stay mixed with shape or appearance. It adds this control to existing feed-forward models through a dual-path modulation that adjusts geometry via emotion-conditioned normalization and appearance via identity-aware emotion cues. A supporting dataset is built by transferring aligned emotional dynamics across identities to keep timing and consistency intact. This keeps the original reconstruction quality while allowing users to transfer emotions, manipulate them apart from speech movements, and interpolate between states smoothly.

Core claim

What carries the argument

Dual-path modulation mechanism that separates geometry modulation (emotion-conditioned normalization in parametric space) from appearance modulation (identity-aware emotion-dependent cues).

If this is right

Existing feed-forward 3D head avatar architectures gain emotion control without changes to their core design.
Emotion transfer becomes controllable and consistent across different identities.
Emotional state can be disentangled from speech-driven articulation for separate manipulation.
Smooth interpolation between emotional states is supported while preserving reconstruction fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This separation could support real-time emotion editing in virtual meetings or games using single photos.
The dataset construction technique might apply to other disentanglement tasks like age or lighting control.
Extending the modulation to video inputs could enable dynamic emotion sequences beyond static images.

Load-bearing premise

That emotional dynamics can be transferred and aligned across different identities to create a time-synchronized dataset without artifacts or identity leakage.

What would settle it

Apply the same emotion sequence to multiple identities in the dataset and check whether the outputs show consistent timing, no visible artifacts, and no identity mixing.

Figures

Figures reproduced from arXiv: 2604.14541 by Hao Pan, Hao Zhu, Jiahao Li, Jiawei Zhang, Lei Chu, Liqiang Liu, Yan Lu, Yanwen Wang, Yicheng Gong.

**Figure 1.** Figure 1: Emotion-Interpolated 3D Head Avatars. Given a single image, our feed-forward framework reconstructs expressive 3D avatars with explicit and interpolatable emotion control. The same blended emotion exhibits identity-aware variations, and all results are produced in a single forward pass without per-identity optimization. Abstract. We present a framework for explicit emotion control in feedforward, single-i… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed framework. The system consists of (left) emotionconsistent data curation and (right) dual-path emotion-aware modulation. We build a time-synchronized multi-identity dataset by transferring frame-aligned emotional dynamics from anchor subjects, yielding explicit emotion supervision disentangled from speech and identity. The reconstruction network modulates geometry and appearance … view at source ↗

**Figure 3.** Figure 3: Reconstruction and reenactment on a held-out synthetic dataset. Top: selfidentity reenactment using the subject’s own motion. Bottom: cross-identity reenactment driven by another subject. Compared to the original baseline, emotion-aware modulation introduces no degradation in generation quality, consistently maintaining the same level of reconstruction fidelity and driving accuracy across settings and ba… view at source ↗

**Figure 4.** Figure 4: Emotion transfer with explicit emotion control. Rows show four identities (top: synthetic with GT; others: real), and columns compare methods. The target emotion is indicated on the left. Our approach enforces the specified emotion independent of source appearance or motion, yielding identity-consistent results [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation of dual-path emotion modulation. Without appearance modulation, texture-level emotion leaks from the reference; without geometry modulation, deformation follows the driving motion. The full model suppresses both and matches the target emotion. Additionally, geometry reuse across backbones achieves comparable results, highlighting its robust transferability [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Emotion control across identities under fixed geometry. Rows: identities; columns: seven target emotions. All results share the same driving FLAME sequence. Identities respond differently to the same emotion, while each identity exhibits consistent emotion-specific variation. content, and apply different emotion controls through the geometry modulation branch. As illustrated in [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 7.** Figure 7: Motion robustness under fixed emotion. Rows fix identity and emotion; columns vary driving FLAME motion. Emotion remains consistent across articulations while preserving identity [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Continuous emotion interpolation. For a fixed identity, rows interpolate between happy–neutral–sad and disgust–fear–angry. Despite discrete training labels, embedding interpolation yields smooth geometry and appearance transitions. FLAME geometry is shown in the inset. sition in affect. Throughout the interpolation, identity-specific facial characteristics remain stable, speech-driven articulation is pres… view at source ↗

read the original abstract

We present a framework for explicit emotion control in feed-forward, single-image 3D head avatar reconstruction. Unlike existing pipelines where emotion is implicitly entangled with geometry or appearance, we treat emotion as a first-class control signal that can be manipulated independently and consistently across identities. Our method injects emotion into existing feed-forward architectures via a dual-path modulation mechanism without modifying their core design. Geometry modulation performs emotion-conditioned normalization in the original parametric space, disentangling emotional state from speech-driven articulation, while appearance modulation captures identity-aware, emotion-dependent visual cues beyond geometry. To enable learning under this setting, we construct a time-synchronized, emotion-consistent multi-identity dataset by transferring aligned emotional dynamics across identities. Integrated into multiple state-of-the-art backbones, our framework preserves reconstruction and reenactment fidelity while enabling controllable emotion transfer, disentangled manipulation, and smooth emotion interpolation, advancing expressive and scalable 3D head avatars.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds explicit emotion control to existing single-image 3D head avatar pipelines via dual-path modulation and a cross-identity dataset transfer trick, but the lack of reported checks on the transferred data leaves the disentanglement claim unproven.

read the letter

The core idea is to treat emotion as a separate control signal that plugs into feedforward single-image 3D head models without touching their main architecture. Geometry modulation normalizes in the parametric space to pull emotion away from speech-driven motion, while appearance modulation adds the visual cues that depend on both emotion and identity. They train this by building a synchronized multi-identity dataset through transferring emotional dynamics across people.

Referee Report

2 major / 1 minor

Summary. The paper presents a framework for explicit emotion control in feed-forward single-image 3D head avatar reconstruction. It treats emotion as a first-class independent control signal and injects it into existing architectures via a dual-path modulation mechanism (geometry modulation through emotion-conditioned normalization in parametric space to disentangle from speech-driven articulation, plus appearance modulation for identity-aware emotion cues) without altering the core backbone design. A key enabler is the construction of a time-synchronized, emotion-consistent multi-identity dataset via transfer of aligned emotional dynamics across identities. The method is claimed to preserve reconstruction/reenactment fidelity while supporting controllable emotion transfer, disentangled manipulation, and smooth interpolation.

Significance. If the central claims hold after validation, the work would be a meaningful contribution to 3D head avatar research by enabling practical, backbone-agnostic addition of explicit emotion control. The dual-path design and cross-identity dataset construction address a real entanglement issue in current pipelines. Credit is due for the emphasis on integration without core modifications and the focus on consistency across identities, which could improve scalability of expressive avatars.

major comments (2)

[Abstract / Dataset Construction] Dataset construction (as described in the abstract): the central claim that emotion can be manipulated independently and consistently across identities rests on the transferred multi-identity corpus preserving disentanglement. No quantitative validation is supplied (e.g., identity classification accuracy on neutral frames, emotion consistency scores across transferred sequences, or metrics for misalignment/artifacts), which directly risks the dual-path modulation learning correlated rather than independent factors.
[Abstract / Evaluation] Evaluation and results sections: the abstract asserts that the framework 'preserves reconstruction and reenactment fidelity' and enables 'controllable emotion transfer' when integrated into multiple state-of-the-art backbones, yet no quantitative results, ablation studies, baseline comparisons, or implementation details (losses, training procedure, modulation equations) are provided. This makes it impossible to assess whether the geometry normalization truly disentangles emotional state from articulation or introduces artifacts.

minor comments (1)

[Abstract] The abstract uses the phrase 'giving faces their feelings back' in the title but does not clarify how this relates to prior implicit emotion handling in the literature; a brief positioning sentence would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential contribution of the dual-path modulation approach and the cross-identity dataset construction. We address each major comment below, providing clarifications and committing to revisions where the manuscript can be strengthened without misrepresenting the presented work.

read point-by-point responses

Referee: [Abstract / Dataset Construction] Dataset construction (as described in the abstract): the central claim that emotion can be manipulated independently and consistently across identities rests on the transferred multi-identity corpus preserving disentanglement. No quantitative validation is supplied (e.g., identity classification accuracy on neutral frames, emotion consistency scores across transferred sequences, or metrics for misalignment/artifacts), which directly risks the dual-path modulation learning correlated rather than independent factors.

Authors: We agree that explicit quantitative validation of the transferred dataset would better support the disentanglement claim. The manuscript describes the construction via transfer of aligned emotional dynamics to maintain time-synchronization and emotion consistency across identities, but does not report the suggested metrics. In the revised version we will add quantitative evaluations, including identity classification accuracy on neutral frames, emotion consistency scores across transferred sequences, and misalignment metrics, to demonstrate that the corpus preserves independent factors. revision: yes
Referee: [Abstract / Evaluation] Evaluation and results sections: the abstract asserts that the framework 'preserves reconstruction and reenactment fidelity' and enables 'controllable emotion transfer' when integrated into multiple state-of-the-art backbones, yet no quantitative results, ablation studies, baseline comparisons, or implementation details (losses, training procedure, modulation equations) are provided. This makes it impossible to assess whether the geometry normalization truly disentangles emotional state from articulation or introduces artifacts.

Authors: The abstract summarizes the claims at a high level. The full manuscript presents qualitative results across multiple backbones showing fidelity preservation and controllable transfer, along with the dual-path design rationale. However, we acknowledge that additional quantitative support would allow a more rigorous assessment of disentanglement. In the revision we will expand the evaluation section with quantitative metrics (e.g., reconstruction error, reenactment fidelity scores), ablation studies on each modulation path, baseline comparisons, and explicit details on losses, training procedure, and modulation equations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; extends external backbones with additive modulation

full rationale

The paper's core derivation introduces a dual-path modulation (geometry normalization in parametric space plus appearance modulation) into existing feed-forward single-image 3D head avatar architectures without altering their core design. Dataset construction via cross-identity emotion transfer is presented as an enabling preprocessing step rather than a fitted or self-derived quantity. No equations, predictions, or uniqueness claims reduce to self-definition, fitted inputs renamed as outputs, or load-bearing self-citations. The framework is explicitly integrated into multiple state-of-the-art external backbones, preserving fidelity while adding controllability, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the separability of emotion from geometry and appearance plus the feasibility of cross-identity emotion transfer; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Emotion can be treated as an independent control signal separable from identity, speech articulation, and appearance in 3D head models
Invoked as the basis for the dual-path modulation mechanism and dataset construction.

pith-pipeline@v0.9.0 · 5483 in / 1212 out tokens · 28207 ms · 2026-05-10T11:45:10.959350+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

103 extracted references · 3 canonical work pages

[1]

In: CVPR

Abdal, R., Lee, H.Y., Zhu, P., Chai, M., Siarohin, A., Wonka, P., Tulyakov, S.: 3davatargan: Bridging domains for personalized editable avatars. In: CVPR. pp. 4552--4562 (June 2023)

2023
[2]

In: CVPR

Abdal, R., Yifan, W., Shi, Z., Xu, Y., Po, R., Kuang, Z., Chen, Q., Yeung, D.Y., Wetzstein, G.: Gaussian shell maps for efficient 3d human generation. In: CVPR. pp. 9441--9451 (June 2024)

2024
[3]

In: ICCV (2025)

Aneja, S., Sevastopolsky, A., Kirschstein, T., Thies, J., Dai, A., Nie ner, M.: Gaussianspeech: Audio-driven gaussian avatars. In: ICCV (2025)

2025
[4]

In: WACV

Bhattarai, A.R., Nie ner, M., Sevastopolsky, A.: Triplanenet: An encoder for eg3d inversion. In: WACV. pp. 3055--3065 (2024)

2024
[5]

In: SIGGRAPH

Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: SIGGRAPH. pp. 187--194. ACM Press (1999)

1999
[6]

In: CVPR (2021)

Buehler, M.C., Meka, A., Li, G., Beeler, T., Hilliges, O.: Varitex: Variational neural face textures. In: CVPR (2021)

2021
[7]

In: ICCV

B \"u hler, M.C., Sarkar, K., Shah, T., Li, G., Wang, D., Helminger, L., Orts-Escolano, S., Lagun, D., Hilliges, O., Beeler, T., et al.: Preface: A data-driven volumetric prior for few-shot ultra high-resolution face synthesis. In: ICCV. pp. 3402--3413 (2023)

2023
[8]

ACM TOG 41(4) (Jul 2022)

Cao, C., Simon, T., Kim, J.K., Schwartz, G., Zollhoefer, M., Saito, S.S., Lombardi, S., Wei, S.E., Belko, D., Yu, S.I., Sheikh, Y., Saragih, J.: Authentic volumetric avatars from a phone scan. ACM TOG 41(4) (Jul 2022)

2022
[9]

IEEE TVCG 20(3), 413--425 (2014)

Cao, C., Weng, Y., Zhou, S., Tong, Y., Zhou, K.: Facewarehouse: A 3d facial expression database for visual computing. IEEE TVCG 20(3), 413--425 (2014)

2014
[10]

In: CVPR (2021)

Chan, E., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In: CVPR (2021)

2021
[11]

In: CVPR (2022)

Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., Mello, S.D., Gallo, O., Guibas, L., Tremblay, J., Khamis, S., Karras, T., Wetzstein, G.: Efficient geometry-aware 3D generative adversarial networks. In: CVPR (2022)

2022
[12]

In: SIGGRAPH Conference Papers

Chen, Y., Wang, L., Li, Q., Xiao, H., Zhang, S., Yao, H., Liu, Y.: Monogaussianavatar: Monocular gaussian point-based head avatar. In: SIGGRAPH Conference Papers. pp. 1--9 (2024)

2024
[13]

In: NeurIPS

Chu, X., Harada, T.: Generalizable and animatable gaussian head avatar. In: NeurIPS. vol. 37, pp. 57642--57670 (2024)

2024
[14]

In: ICLR (2024)

Chu, X., Li, Y., Zeng, A., Yang, T., Lin, L., Liu, Y., Harada, T.: Gpavatar: Generalizable and precise head avatar from image (s). In: ICLR (2024)

2024
[15]

In: SIGGRAPH Asia Conference Papers

Cui, J., Chen, Y., Xu, M., Shang, H., Chen, Y., Zhan, Y., Dong, Z., Yao, Y., Wang, J., Zhu, S.: Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation. In: SIGGRAPH Asia Conference Papers. ACM (2025)

2025
[16]

In: ICLR (2025)

Cui, J., Li, H., Yao, Y., Zhu, H., Shang, H., Cheng, K., Zhou, H., Zhu, S., Wang, J.: Hallo2: Long-duration and high-resolution audio-driven portrait image animation. In: ICLR (2025)

2025
[17]

In: CVPR (2025)

Cui, J., Li, H., Zhan, Y., Shang, H., Cheng, K., Ma, Y., Mu, S., Zhou, H., Wang, J., Zhu, S.: Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. In: CVPR (2025)

2025
[18]

In: CVPR

Deng, Y., Wang, D., Ren, X., Chen, X., Wang, B.: Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data. In: CVPR. pp. 7119--7130 (2024)

2024
[19]

In: ECCV (2024)

Deng, Y., Wang, D., Wang, B.: Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer. In: ECCV (2024)

2024
[20]

In: ECCV (2024)

Dhamo, H., Nie, Y., Moreau, A., Song, J., Shaw, R., Zhou, Y., Pérez-Pellitero, E.: Headgas: Real-time animatable head avatars via 3d gaussian splatting. In: ECCV (2024)

2024
[21]

In: CVPR (2022)

Fan, Y., Lin, Z., Saito, J., Wang, W., Komura, T.: Faceformer: Speech-driven 3d facial animation with transformers. In: CVPR (2022)

2022
[22]

In: SIGGRAPH Asia Conference Papers (2024)

Gao, X., Xiao, H., Zhong, C., Hu, S., Guo, Y., Zhang, J.: Portrait video editing empowered by multimodal generative priors. In: SIGGRAPH Asia Conference Papers (2024)

2024
[23]

ACM TOG 41(6) (2022)

Gao, X., Zhong, C., Xiang, J., Hong, Y., Guo, Y., Zhang, J.: Reconstructing personalized semantic facial nerf models from monocular video. ACM TOG 41(6) (2022)

2022
[24]

In: SIGGRAPH Asia Conference Papers (2025)

Gao, X., Zhou, J., Liu, D., Zhou, Y., Zhang, J.: Constructing diffusion avatar with learnable embeddings. In: SIGGRAPH Asia Conference Papers (2025)

2025
[25]

In: SIGGRAPH Asia Conference Papers (2024)

Giebenhain, S., Kirschstein, T., R \" u nz, M., Agapito, L., Nie ner, M.: Npga: Neural parametric gaussian avatars. In: SIGGRAPH Asia Conference Papers (2024)

2024
[26]

In: CVPR (2025)

Gu, Y., Tran, P., Zheng, Y., Xu, H., Li, H., Karmanov, A., Li, H.: Diffportrait360: Consistent portrait diffusion for 360 view synthesis. In: CVPR (2025)

2025
[27]

In: ICMI (2023)

Haque, K.I., Yumak, Z.: Facexhubert: Text-less speech-driven e (x) pressive 3d facial animation synthesis using self-supervised speech representation learning. In: ICMI (2023)

2023
[28]

ACM TOG 44(4), 1--12 (2025)

He, C., Li, J., Kirschstein, T., Sevastopolsky, A., Saito, S., Tan, Q., Romero, J., Cao, C., Rushmeier, H., Nam, G.: 3dgh: 3d head generation with composable hair and face. ACM TOG 44(4), 1--12 (2025)

2025
[29]

In: ECCV

He, Q., Ji, X., Gong, Y., Lu, Y., Diao, Z., Huang, L., Yao, Y., Zhu, S., Ma, Z., Xu, S., et al.: Emotalk3d: High-fidelity free-view synthesis of emotional 3d talking head. In: ECCV. Springer (2024)

2024
[30]

In: SIGGRAPH Conference Papers

He, Y., Gu, X., Ye, X., Xu, C., Zhao, Z., Dong, Y., Yuan, W., Dong, Z., Bo, L.: Lam: Large avatar model for one-shot animatable gaussian head. In: SIGGRAPH Conference Papers. pp. 1--13 (2025)

2025
[31]

In: CVPR

Hong, F.T., Zhang, L., Shen, L., Xu, D.: Depth-aware generative adversarial network for talking head video generation. In: CVPR. pp. 3397--3406 (2022)

2022
[32]

In: CVPR

Hong, Y., Peng, B., Xiao, H., Liu, L., Zhang, J.: Headnerf: A real-time nerf-based parametric head model. In: CVPR. pp. 20374--20384 (2022)

2022
[33]

In: ICLR (2026)

Ji, X., Weiss, S., Kansy, M., Naruniec, J., Cao, X., Solenthaler, B., Bradley, D.: Fast GHA : Generalized few-shot 3d gaussian head avatars with real-time animation. In: ICLR (2026)

2026
[34]

In: SIGGRAPH Conference Papers

Ji, X., Zhou, H., Wang, K., Wu, Q., Wu, W., Xu, F., Cao, X.: Eamm: One-shot emotional talking face via audio-based emotion-aware motion model. In: SIGGRAPH Conference Papers. SIGGRAPH '22 (2022)

2022
[35]

In: CVPR

Ji, X., Zhou, H., Wang, K., Wu, W., Loy, C.C., Cao, X., Xu, F.: Audio-driven emotional video portraits. In: CVPR. pp. 15480--15489 (2021)

2021
[36]

ACM TOG 42(4), 139--1 (2023)

Kerbl, B., Kopanas, G., Leimk \"u hler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM TOG 42(4), 139--1 (2023)

2023
[37]

In: CVPR (2026)

Kirschstein, T., Giebenhain, S., Nie ner, M.: Flexavatar: Learning complete 3d head avatars with partial supervision. In: CVPR (2026)

2026
[38]

In: SIGGRAPH Asia Conference Papers

Kirschstein, T., Giebenhain, S., Tang, J., Georgopoulos, M., Nie ner, M.: GGHead: Fast and Generalizable 3D Gaussian Heads . In: SIGGRAPH Asia Conference Papers. SA '24, Association for Computing Machinery, New York, NY, USA (2024)

2024
[39]

ACM TOG 42(4) (jul 2023)

Kirschstein, T., Qian, S., Giebenhain, S., Walter, T., Nie ner, M.: Nersemble: Multi-view radiance field reconstruction of human heads. ACM TOG 42(4) (jul 2023)

2023
[40]

In: ICCV (2025)

Kirschstein, T., Romero, J., Sevastopolsky, A., Nie ner, M., Saito, S.: Avat3r: Large animatable gaussian reconstruction model for high-fidelity 3d head avatars. In: ICCV (2025)

2025
[41]

In: ECCV (2024)

Li, H., Chen, C., Shi, T., Qiu, Y., An, S., Chen, G., Han, X.: Spherehead: Stable 3d full-head synthesis with spherical tri-plane representation. In: ECCV (2024)

2024
[42]

In: NeurIPS (2025)

Li, H., Liu, K., Qiu, L., Zuo, Q., Zheng, K., Dong, Z., Han, X.: Hyplanehead: Rethinking tri-plane-like representations in full-head image synthesis. In: NeurIPS (2025)

2025
[43]

In: ICLR (2026)

Li, H., Zhang, H., Qiu, Y., Sun, Z., Zheng, K., Qiu, L., Li, P., Zuo, Q., Chen, C., Zheng, Y., et al.: Condition matters in full-head 3d gans. In: ICLR (2026)

2026
[44]

In: CVPR

Li, L., Li, Y., Weng, Y., Zheng, Y., Zhou, K.: Rgbavatar: Reduced gaussian blendshapes for online modeling of head avatars. In: CVPR. pp. 10747--10757 (June 2025)

2025
[45]

ACM TOG 36(6), 194--1 (2017)

Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4d scans. ACM TOG 36(6), 194--1 (2017)

2017
[46]

In: CVPR

Li, W., Zhang, L., Wang, D., Zhao, B., Wang, Z., Chen, M., Zhang, B., Wang, Z., Bo, L., Li, X.: One-shot high-fidelity talking-head synthesis with deformable neural radiance field. In: CVPR. pp. 17969--17978 (2023)

2023
[47]

In: ECCV

Li, X., Cheng, Y., Ren, X., Jia, H., Xu, D., Zhu, W., Yan, Y.: Topo4d: Topology-preserving gaussian splatting for high-fidelity 4d head capture. In: ECCV. pp. 128--145. Springer (2024)

2024
[48]

In: CVPR

Li, X., Wang, J., Cheng, Y., Zeng, Y., Ren, X., Zhu, W., Zhao, W., Yan, Y.: Towards high-fidelity 3d talking avatar with personalized dynamic texture. In: CVPR. pp. 204--214 (2025)

2025
[49]

In: NeurIPS

Li, X., De Mello, S., Liu, S., Nagano, K., Iqbal, U., Kautz, J.: Generalizable one-shot 3d neural head avatar. In: NeurIPS. vol. 36 (2024)

2024
[50]

In: CVPR (2026)

Li, Z., Pun, C.M., Fang, C., Wang, J., Cun, X.: Personalive! expressive portrait image animation for live streaming. In: CVPR (2026)

2026
[51]

In: CVPR (2026)

Liu, C., Jing, T., Ma, C., Zhou, X., Lian, Z., Jin, Q., Yuan, H., Huang, S.S.: Emodifftalk: Emotion-aware diffusion for editable 3d gaussian talking head. In: CVPR (2026)

2026
[52]

In: CVPR (2025)

Liu, H., Wang, X., Wan, Z., Ma, Y., Chen, J., Fan, Y., Shen, Y., Song, Y., Chen, Q.: Avatarartist: Open-domain 4d avatarization. In: CVPR (2025)

2025
[53]

In: SIGGRAPH Conference Papers

Liu, H., Wang, X., Wan, Z., Shen, Y., Song, Y., Liao, J., Chen, Q.: Headartist: Text-conditioned 3d head generation with self score distillation. In: SIGGRAPH Conference Papers. SIGGRAPH '24, Association for Computing Machinery, New York, NY, USA (2024)

2024
[54]

Ma, C., Tan, S., Pan, Y., Yang, J., Tong, X.: Esgaussianface: Emotional and stylized audio-driven facial animation via 3d gaussian splatting. TVCG pp. 1--12 (2026)

2026
[55]

In: CVPR

Ma, Z., Zhu, X., Qi, G.J., Lei, Z., Zhang, L.: Otavatar: One-shot talking face avatar with controllable tri-plane rendering. In: CVPR. pp. 16901--16910 (2023)

2023
[56]

In: NeurIPS Track on Datasets and Benchmarks (2024)

Martinez, J., Kim, E., Romero, J., Bagautdinov, T., Saito, S., Yu, S.I., Anderson, S., Zollhöfer, M., Team, C.A.: Codec Avatar Studio: Paired Human Captures for Complete, Driveable, and Generalizable Avatars . In: NeurIPS Track on Datasets and Benchmarks (2024)

2024
[57]

In: ECCV (2020)

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)

2020
[58]

arXiv preprint arXiv:2312.06400 (2023)

Mir, A., Alonso, E., Mondragón, E.: Dit-head: High-resolution talking head synthesis using diffusion transformers. arXiv preprint arXiv:2312.06400 (2023)

work page arXiv 2023
[59]

in-the-wild

Paraperas Papantoniou, F., Filntisis, P.P., Maragos, P., Roussos, A.: Neural emotion director: Speech-preserving semantic control of facial expressions in "in-the-wild" videos. In: CVPR (2022)

2022
[60]

In: ACM MM

Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: ACM MM. p. 484–492. MM '20, Association for Computing Machinery, New York, NY, USA (2020)

2020
[61]

In: CVPR

Qian, S., Kirschstein, T., Schoneveld, L., Davoli, D., Giebenhain, S., Nie ner, M.: Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. In: CVPR. pp. 20299--20309 (2024)

2024
[62]

In: ICCV

Richard, A., Zollh\"ofer, M., Wen, Y., de la Torre, F., Sheikh, Y.: Meshtalk: 3d face animation from speech using cross-modality disentanglement. In: ICCV. pp. 1173--1182 (October 2021)

2021
[63]

In: ACCV

Shen, X., Khan, F.F., Elhoseiny, M.: Emotalker: Audio driven emotion aware talking head generation. In: ACCV. pp. 1900--1917 (December 2024)

1900
[64]

In: SIGGRAPH Conference Papers

Song, L., Zhou, Y., Xu, Z., Zhou, Y., Aneja, D., Xu, C.: Streamme: Simplify 3d gaussian avatar within live stream. In: SIGGRAPH Conference Papers. SIGGRAPH '25, Association for Computing Machinery, New York, NY, USA (2025)

2025
[65]

In: WACV

Stypu kowski, M., Vougioukas, K., He, S., Zi e ba, M., Petridis, S., Pantic, M.: Diffused heads: Diffusion models beat gans on talking-face generation. In: WACV. pp. 5091--5100 (2024)

2024
[66]

In: CVPR (2023)

Sun, J., Wang, X., Wang, L., Li, X., Zhang, Y., Zhang, H., Liu, Y.: Next3d: Generative neural texture rasterization for 3d-aware head avatars. In: CVPR (2023)

2023
[67]

ACM TOG 43(4) (2024)

Sun, Z., Lv, T., Ye, S., Lin, M., Sheng, J., Wen, Y.H., Yu, M., Liu, Y.J.: Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models. ACM TOG 43(4) (2024)

2024
[68]

In: ECCV

Tan, S., Ji, B., Bi, M., Pan, Y.: Edtalk: Efficient disentanglement for emotional talking head synthesis. In: ECCV. pp. 398--416. Springer (2025)

2025
[69]

In: ICCV

Tan, S., Ji, B., Pan, Y.: Emmn: Emotional motion memory network for audio-driven emotional talking face generation. In: ICCV. pp. 22146--22156 (2023)

2023
[70]

In: SIGGRAPH Asia Conference Papers

Taubner, F., Zhang, R., Tuli, M., Bahmani, S., Lindell, D.B.: MVP4D : Multi-view portrait video diffusion for animatable 4D avatars. In: SIGGRAPH Asia Conference Papers. ACM (2025)

2025
[71]

In: CVPR

Taubner, F., Zhang, R., Tuli, M., Lindell, D.B.: CAP4D : Creating animatable 4D portrait avatars with morphable multi-view diffusion models. In: CVPR. pp. 5318--5330 (June 2025)

2025
[72]

In: SIGGRAPH 2022 Conference Papers

Wang, D., Chandran, P., Zoss, G., Bradley, D., Gotardo, P.: Morf: Morphable radiance fields for multiview neural head modeling. In: SIGGRAPH 2022 Conference Papers. SIGGRAPH '22, Association for Computing Machinery, New York, NY, USA (2022)

2022
[73]

In: CVPR

Wang, D., Deng, Y., Yin, Z., Shum, H.Y., Wang, B.: Progressive disentangled representation learning for fine-grained controllable talking head synthesis. In: CVPR. pp. 17979--17989 (2023)

2023
[74]

TVCG (2025)

Wang, J., Xie, J.C., Li, X., Xu, F., Pun, C.M., Gao, H.: Gaussianhead: High-fidelity head avatars with learnable gaussian derivation. TVCG (2025)

2025
[75]

In: ECCV

Wang, K., Wu, Q., Song, L., Yang, Z., Wu, W., Qian, C., He, R., Qiao, Y., Loy, C.C.: Mead: A large-scale audio-visual dataset for emotional talking-face generation. In: ECCV. Springer (2020)

2020
[76]

In: CVPR (June 2022)

Wang, L., Chen, Z., Yu, T., Ma, C., Li, L., Liu, Y.: Faceverse: a fine-grained and detail-controllable 3d face morphable model from a hybrid dataset. In: CVPR (June 2022)

2022
[77]

In: CVPR

Wang, T.C., Mallya, A., Liu, M.Y.: One-shot free-view neural talking-head synthesis for video conferencing. In: CVPR. pp. 10039--10049 (2021)

2021
[78]

Aniportrait: Audio-driven synthesis of photorealistic portrait animation

Wei, H., Yang, Z., Wang, Z.: Aniportrait: Audio-driven synthesis of photorealistic portrait animations. arXiv:2403.17694 (2024)

work page arXiv 2024
[79]

In: ICLR (2025)

Wu, T.W., Yang, J., Guo, Z., Wan, J., Zhong, F., Oztireli, C.: Gaussian head & shoulders: High fidelity neural upper body avatars with anchor gaussian guided texture warping. In: ICLR (2025)

2025
[80]

In: CVPR (2026)

Wu, Z., Zhou, B., Hu, L., Liu, H., Sun, Y., Wang, X., Cao, X., Shen, Y., Zhu, H.: Uika: Fast universal head avatar from pose-free images. In: CVPR (2026)

2026

Showing first 80 references.