arxiv: 2604.22865 · v1 · submitted 2026-04-23 · 💻 cs.CV

Recognition: unknown

MeshLAM: Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction

Yisheng He , Steven Hoi

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords animatable mesh reconstructionone-shot 3D avatarfeed-forward head modelingtextured mesh generationsingle-image 3Dtransformer for meshesGRU progressive deformation

0 comments

The pith

A single forward pass from one image yields a complete, animatable textured 3D head mesh.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that high-fidelity animatable head avatars can be generated directly as meshes from a single photograph without test-time optimization or multiple input views. It does this through a dual architecture that jointly handles vertex positions and texture maps via a shared transformer, followed by iterative refinement that progressively deforms the geometry and updates the appearance. A sympathetic reader would care because this removes the need for lengthy per-instance fitting or camera arrays, potentially allowing instant creation of moving 3D faces from ordinary snapshots. The approach centers on preserving mesh topology during deformation while anchoring textures to the source image through reprojection.

Core claim

MeshLAM produces complete mesh representations with inherent animatability from a single image in a single forward pass by employing a dual shape and texture map architecture that simultaneously processes mesh vertices and texture maps with extracted image features from a shared transformer backbone, using an iterative GRU-based decoding mechanism with progressive geometry deformation and texture refinement together with a reprojection-based texture guidance mechanism.

What carries the argument

Dual shape and texture map architecture with shared transformer backbone and iterative GRU-based progressive deformation, which jointly carves geometry and refines appearance while preventing collapse.

If this is right

The output meshes are immediately usable for animation without additional rigging steps.
Reconstruction quality, animation fidelity, and runtime speed all exceed those of optimization-based and multi-view baselines.
Topological integrity is maintained through the progressive deformation schedule even in a feed-forward setting.
Appearance remains anchored to the input image via the reprojection guidance term.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same single-pass pipeline could support real-time avatar updates if applied frame-by-frame to video.
If the deformation mechanism generalizes, the approach might extend to full-body or hand meshes with minimal changes.
Deployment on mobile devices becomes feasible once the forward pass fits within typical GPU memory limits.
Failure on extreme head poses would point to the need for stronger pose-invariant feature extraction.

Load-bearing premise

The shared transformer plus iterative GRU refinement can preserve mesh topology and produce coherent textures from single-image features alone.

What would settle it

Run the model on a held-out single image, apply standard facial animation rigs to the output mesh, and check whether self-intersections appear or the reprojected texture deviates from the input photo.

Figures

Figures reproduced from arXiv: 2604.22865 by Steven Hoi, Yisheng He.

**Figure 1.** Figure 1: Overall Framework. Our method reconstructs an animatable 3D texture head mesh from a single image through dual shape and texture branches. After extracting features from the input image with a shared transformer, the shape branch predicts vertex deformations while the texture branch synthesizes UV texture maps, both conditioned on a FLAME template. Both branches are refined iteratively via GRU decoders wit… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison of 3D head avatar creation and animation on challenging texture cases. Our mesh-based framework [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The cross-domain generalization capability of our frame [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of iterative mesh decoding. Without our GRU [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of texture space reprojection. Without our repro [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Effect of part-aware deformation. Without semantic [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Reconstructed geometry and texture map visualization. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: In the wild challenging lighting and occlusion. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison with more baselines. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

We introduce MeshLAM, a feed-forward framework for one-shot animatable mesh head reconstruction that generates high-fidelity, animatable 3D head avatars from a single image. Unlike previous work that relies on time-consuming test-time optimization or extensive multi-view data, our method produces complete mesh representations with inherent animatability from a single image in a single forward pass. Our approach employs a dual shape and texture map architecture that simultaneously processes mesh vertices and texture map with extracted image features from a shared transformer backbone, allowing for coherent shape carving and appearance modeling. To prevent mesh collapse and ensure topological integrity during feed-forward deformation, we propose an iterative GRU-based decoding mechanism with progressive geometry deformation and texture refinement, coupled with a novel reprojection-based texture guidance mechanism that anchors appearance learning to the input image. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in reconstruction quality, animation capability, and computational efficiency. Project page at https://meshlam.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MeshLAM's feed-forward dual shape-texture GRU setup aims to skip optimization for one-shot animatable meshes, but the deformation step looks under-constrained for guaranteed topology from monocular input.

read the letter

The main takeaway is that MeshLAM claims a single forward pass can produce complete, animatable textured head meshes from one image by running shape and texture through a shared transformer, then using iterative GRU decoding for progressive deformation plus reprojection to anchor the texture. If the numbers and visuals check out, this removes the usual test-time optimization loop that slows down prior avatar pipelines. The architecture choice of handling vertices and UV maps together with the same backbone is a straightforward way to encourage coherence between geometry and appearance. The GRU progressive refinement is presented as the mechanism that stops collapse and keeps the mesh usable for animation. That combination is what the paper positions as new relative to optimization-heavy or multi-view methods. The practical upside is clear if it works: faster reconstruction for AR/VR or content tools. The soft spot sits exactly where the stress-test note flags it. Single-image depth and correspondence are ambiguous, so the deformation needs explicit safeguards such as fixed connectivity, Laplacian terms, or collision handling to keep the output topologically valid and animatable across expressions. The abstract says the GRU prevents collapse and the reprojection guides texture, but without seeing the precise regularization or loss terms in the full text, it is hard to judge whether those are sufficient or just descriptive. If the experiments include ablations on deformation stability and animation artifacts, that would strengthen the case; otherwise the central claim rests on the architecture alone. This paper is for people working on feed-forward 3D avatar pipelines who want to move away from per-instance optimization. A reader already familiar with transformer-based mesh decoders and GRU refinement would extract the most value from the design details. It deserves peer review because the goal is relevant and the claims are concrete enough to test, even if the topology question needs direct evidence from the implementation and results.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces MeshLAM, a feed-forward framework for one-shot reconstruction of animatable textured mesh avatars from a single image. It employs a dual shape and texture map architecture with a shared transformer backbone and an iterative GRU-based progressive deformation mechanism with reprojection-based texture guidance to produce complete meshes that are inherently animatable without test-time optimization or multi-view data. The authors claim superior performance over state-of-the-art methods in reconstruction quality, animation capability, and efficiency.

Significance. Should the quantitative results and ablations support the claims, this would be a notable contribution to the field of 3D avatar reconstruction by enabling efficient, single-image feed-forward generation of animatable meshes, which could impact applications in AR, VR, and digital humans. The avoidance of optimization at test time is particularly promising for real-time use cases.

major comments (1)

The abstract states that the GRU-based mechanism 'prevents mesh collapse' and ensures 'topological integrity' during feed-forward deformation, but provides no details on explicit constraints such as fixed connectivity, Laplacian regularization, or collision terms. This is load-bearing for the central 'inherent animatability' claim, as monocular depth ambiguity could otherwise allow folding or inconsistent vertex/UV coherence under large expression changes.

minor comments (2)

The abstract references 'extensive experiments' demonstrating outperformance but includes no quantitative metrics, ablation results, or error analysis; a brief summary of key numbers (e.g., reconstruction error, animation fidelity) should be added.
Notation for the dual shape-texture architecture and the reprojection-based guidance is introduced without equations or pseudocode in the provided text; adding these would improve clarity of the iterative decoding process.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and will revise the manuscript to provide additional clarifications on the mechanisms for maintaining mesh stability.

read point-by-point responses

Referee: The abstract states that the GRU-based mechanism 'prevents mesh collapse' and ensures 'topological integrity' during feed-forward deformation, but provides no details on explicit constraints such as fixed connectivity, Laplacian regularization, or collision terms. This is load-bearing for the central 'inherent animatability' claim, as monocular depth ambiguity could otherwise allow folding or inconsistent vertex/UV coherence under large expression changes.

Authors: We thank the referee for highlighting this important point regarding clarity. In MeshLAM, we employ a fixed-topology template mesh (detailed in Section 3.1), with predefined vertex connectivity that remains constant throughout the deformation process; this design choice inherently preserves topology without requiring additional constraints. The iterative GRU-based decoding (Section 3.2) performs progressive, multi-step geometry refinement rather than a single large update, which reduces the likelihood of folding or collapse arising from monocular depth ambiguity. The reprojection-based texture guidance mechanism further promotes coherence by anchoring texture predictions to the input image via differentiable rendering, helping maintain vertex-UV consistency during animation. While our loss does not include explicit Laplacian regularization or collision terms, the combination of fixed connectivity, iterative refinement, and guidance has yielded stable results in our experiments and ablations. We agree the abstract and method section would benefit from more explicit discussion of these aspects and will revise accordingly, including expanded text and a supporting figure in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents an architectural ML model (shared transformer + dual shape/texture maps + iterative GRU decoding with reprojection guidance) for single-image mesh avatar reconstruction. No equations, first-principles derivations, or 'predictions' are described that reduce to fitted inputs or self-citations by construction. The central claim of inherent animatability is an empirical outcome of the feed-forward network design rather than a mathematical identity or renamed fit. The provided abstract and context contain no load-bearing self-citations or ansatzes that collapse the result to its own inputs, making the derivation self-contained as a standard neural architecture proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only view provides no explicit free parameters, axioms, or invented entities beyond the high-level neural architecture; standard neural network training assumptions are implicit but unstated.

axioms (1)

domain assumption Neural networks trained on appropriate 3D head datasets can generalize to produce topologically valid meshes from single images.
Implicit requirement for the feed-forward reconstruction to succeed.

invented entities (1)

MeshLAM dual shape-texture architecture no independent evidence
purpose: Simultaneous mesh vertex and texture map processing from image features
Core novel component introduced in the abstract; no independent evidence provided.

pith-pipeline@v0.9.0 · 5464 in / 1206 out tokens · 36138 ms · 2026-05-09T22:04:19.253730+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 12 canonical work pages · 3 internal anchors

[1]

Rignerf: Fully controllable neural 3d portraits

ShahRukh Athar, Zexiang Xu, Kalyan Sunkavalli, Eli Shecht- man, and Zhixin Shu. Rignerf: Fully controllable neural 3d portraits. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 20332–20341. IEEE, 2022. 2

2022
[2]

High-fidelity facial avatar reconstruction from monocular video with generative priors

Yunpeng Bai, Yanbo Fan, Xuan Wang, Yong Zhang, Jingx- iang Sun, Chun Yuan, and Ying Shan. High-fidelity facial avatar reconstruction from monocular video with generative priors. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 4541–4551. IEEE, 2023. 2

2023
[3]

Learning per- sonalized high quality volumetric head avatars from monoc- ular RGB videos

Ziqian Bai, Feitong Tan, Zeng Huang, Kripasindhu Sarkar, Danhang Tang, Di Qiu, Abhimitra Meka, Ruofei Du, Ming- song Dou, Sergio Orts-Escolano, Rohit Pandey, Ping Tan, Thabo Beeler, Sean Fanello, and Yinda Zhang. Learning per- sonalized high quality volumetric head avatars from monoc- ular RGB videos. InIEEE/CVF Conference on Computer Vision and Pattern R...

2023
[4]

A morphable model for the synthesis of 3d faces

V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. InProceedings of the 26th Annual Con- ference on Computer Graphics and Interactive Techniques, SIGGRAPH 1999, Los Angeles, CA, USA, August 8-13, 1999, pages 187–194. ACM, 1999. 2

1999
[5]

How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230, 000 3d facial landmarks)

Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230, 000 3d facial landmarks). InIEEE Interna- tional Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 1021–1030. IEEE Com- puter Society, 2017. 7

2017
[6]

Lempitsky

Egor Burkov, Igor Pasechnik, Artur Grigorev, and Victor S. Lempitsky. Neural head reenactment with latent pose descrip- tors. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13- 19, 2020, pages 13783–13792. Computer Vision Foundation / IEEE, 2020. 2

2020
[7]

Generalizable and ani- matable gaussian head avatar

Xuangeng Chu and Tatsuya Harada. Generalizable and ani- matable gaussian head avatar. InAdvances in Neural Infor- mation Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Van- couver, BC, Canada, December 10 - 15, 2024, 2024. 2, 5, 6, 9, 1

2024
[8]

Gpavatar: Generalizable and precise head avatar from image(s)

Xuangeng Chu, Yu Li, Ailing Zeng, Tianyu Yang, Lijian Lin, Yunfei Liu, and Tatsuya Harada. Gpavatar: Generalizable and precise head avatar from image(s). InThe Twelfth Interna- tional Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 6

2024
[9]

Hallo2: Long-duration and high- resolution audio-driven portrait image animation.arXiv preprint arXiv:2410.07718,

Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, and Jingdong Wang. Hallo2: Long-duration and high-resolution audio-driven portrait im- age animation.CoRR, abs/2410.07718, 2024. 2

work page arXiv 2024
[10]

Arcface: Additive angular margin loss for deep face recognition.IEEE Trans

Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Kotsia, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition.IEEE Trans. Pattern Anal. Mach. Intell., 44(10):5962–5979, 2022. 6

2022
[11]

Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer

Yu Deng, Duomin Wang, and Baoyuan Wang. Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer. In Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XVII, pages 316–333. Springer, 2024. 1

2024
[12]

Face parsing model

Jonathan Dinu. Face parsing model. https : / / huggingface . co / jonathandinu / face - parsing, 2022. Accessed: 2025-01-23. 5, 6, 7

2022
[14]

Mega- portraits: One-shot megapixel neural head avatars

Nikita Drobyshev, Jenya Chelishev, Taras Khakhulin, Aleksei Ivakhnenko, Victor Lempitsky, and Egor Zakharov. Mega- portraits: One-shot megapixel neural head avatars. InMM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, pages 2663–2671. ACM, 2022. 2

2022
[15]

Dynamic neural radiance fields for monocular 4d facial avatar reconstruction

Guy Gafni, Justus Thies, Michael Zollhöfer, and Matthias Nießner. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 8649–8658. Computer Vision Foundation / IEEE, 2021. 2

2021
[16]

Reconstructing personalized seman- tic facial nerf models from monocular video.ACM Trans

Xuan Gao, Chenglai Zhong, Jun Xiang, Yang Hong, Yudong Guo, and Juyong Zhang. Reconstructing personalized seman- tic facial nerf models from monocular video.ACM Trans. Graph., 41(6):200:1–200:12, 2022. 2

2022
[17]

Morphable face models - an open frame- work

Thomas Gerig, Andreas Morel-Forster, Clemens Blumer, Bernhard Egger, Marcel Lüthi, Sandro Schönborn, and Thomas Vetter. Morphable face models - an open frame- work. In13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018, Xi’an, China, May 15-19, 2018, pages 75–82. IEEE Computer Society, 2018. 2

2018
[18]

Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems 27: Annual Confer- ence on Neural Information Processing Systems 2014, Decem- ber 8-13 2014, Montreal, Quebec, Canada, pages 2672–2680,

2014
[19]

arXiv preprint arXiv:2407.03168 , year =

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Effi- 9 cient portrait animation with stitching and retargeting control. CoRR, abs/2407.03168, 2024. 2

work page arXiv 2024
[21]

Ad-nerf: Audio driven neural ra- diance fields for talking head synthesis

Yudong Guo, Keyu Chen, Sen Liang, Yong-Jin Liu, Hujun Bao, and Juyong Zhang. Ad-nerf: Audio driven neural ra- diance fields for talking head synthesis. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 5764–

2021
[22]

Efros, Aleksander Holynski, and Angjoo Kanazawa

Ayaan Haque, Matthew Tancik, Alexei A. Efros, Aleksander Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Edit- ing 3d scenes with instructions. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 19683–19693. IEEE, 2023. 7

2023
[23]

Freditor: High-fidelity and transfer- able nerf editing by frequency decomposition

Yisheng He, Weihao Yuan, Siyu Zhu, Zilong Dong, Liefeng Bo, and Qixing Huang. Freditor: High-fidelity and transfer- able nerf editing by frequency decomposition. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLI, pages 73–91. Springer, 2024. 7

2024
[24]

Lam: Large avatar model for one-shot animatable gaus- sian head

Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: Large avatar model for one-shot animatable gaus- sian head. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–13, 2025. 1, 2, 3, 9

2025
[25]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. 6

2017
[26]

Depth- aware generative adversarial network for talking head video generation

Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu. Depth- aware generative adversarial network for talking head video generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 3387–3396. IEEE, 2022. 2

2022
[27]

Headnerf: A real-time nerf-based parametric head model.CoRR, abs/2112.05637, 2021

Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juyong Zhang. Headnerf: A real-time nerf-based parametric head model.CoRR, abs/2112.05637, 2021. 2

work page arXiv 2021
[28]

Forge4d: Feed-forward 4d human reconstruction and interpo- lation from uncalibrated sparse-view videos.arXiv preprint arXiv:2509.24209, 2025

Yingdong Hu, Yisheng He, Jinnan Chen, Weihao Yuan, Kejie Qiu, Zehong Lin, Siyu Zhu, Zilong Dong, and Jun Zhang. Forge4d: Feed-forward 4d human reconstruction and interpo- lation from uncalibrated sparse-view videos.arXiv preprint arXiv:2509.24209, 2025. 1

work page arXiv 2025
[29]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial net- works. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21- 26, 2017, pages 5967–5976. IEEE Computer Society, 2017. 2

2017
[30]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recogni- tion, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 4401–4410. Computer Vision Foundation / IEEE, 2019. 2

2019
[31]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139:1–139:14,
[32]

Realistic one-shot mesh-based head avatars

Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. Realistic one-shot mesh-based head avatars. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part II, pages 345–362. Springer, 2022. 1, 2

2022
[33]

Realistic one-shot mesh-based head avatars

Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. Realistic one-shot mesh-based head avatars. InEuropean Conference on Computer Vision (ECCV), 2022. 2, 7

2022
[34]

Sapiens: Foundation for human vision mod- els.arXiv preprint arXiv:2408.12569, 2024

Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vision mod- els.arXiv preprint arXiv:2408.12569, 2024. 5

work page arXiv 2024
[35]

Learn- ing to generate conditional tri-plane for 3d-aware expression controllable portrait animation

Taekyung Ki, Dongchan Min, and Gyeongsu Chae. Learn- ing to generate conditional tri-plane for 3d-aware expression controllable portrait animation. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part I, pages 476–493. Springer, 2024. 2

2024
[36]

Avat3r: Large an- imatable gaussian reconstruction model for high-fidelity 3d head avatars.arXiv preprint arXiv:2502.20220, 2025

Tobias Kirschstein, Javier Romero, Artem Sevastopolsky, Matthias Nießner, and Shunsuke Saito. Avat3r: Large an- imatable gaussian reconstruction model for high-fidelity 3d head avatars.arXiv preprint arXiv:2502.20220, 2025. 2

work page arXiv 2025
[37]

Panolam: Large avatar model for gaussian full- head synthesis from one-shot unposed image.arXiv preprint arXiv:2509.07552, 2025

Peng Li, Yisheng He, Yingdong Hu, Yuan Dong, Weihao Yuan, Yuan Liu, Siyu Zhu, Gang Cheng, Zilong Dong, and Yike Guo. Panolam: Large avatar model for gaussian full- head synthesis from one-shot unposed image.arXiv preprint arXiv:2509.07552, 2025. 1, 2

work page arXiv 2025
[38]

Black, Hao Li, and Javier Romero

Tianye Li, Timo Bolkart, Michael J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans.ACM Trans. Graph., 36(6):194:1–194:17,
[39]

Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expres- sion from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017. 1, 3

2017
[40]

One-shot high-fidelity talking- head synthesis with deformable neural radiance field

Weichuang Li, Longhao Zhang, Dong Wang, Bin Zhao, Zhi- gang Wang, Mulin Chen, Bang Zhang, Zhongjian Wang, Liefeng Bo, and Xuelong Li. One-shot high-fidelity talking- head synthesis with deformable neural radiance field. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17- 24, 2023, pages 17969–17978...

2023
[41]

Soap: Style-omniscient animatable portraits

Tingting Liao, Yujian Zheng, Adilbek Karmanov, Liwen Hu, Leyang Jin, Yuliang Xiu, and Hao Li. Soap: Style-omniscient animatable portraits. InACM SIGGRAPH 2025 Conference Proceedings, 2025. 2, 5

2025
[42]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chun- 10 rui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025. 7, 8

work page internal anchor Pith review arXiv 2025
[43]

Otavatar: One-shot talking face avatar with con- trollable tri-plane rendering

Zhiyuan Ma, Xiangyu Zhu, Guojun Qi, Zhen Lei, and Lei Zhang. Otavatar: One-shot talking face avatar with con- trollable tri-plane rendering. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Van- couver, BC, Canada, June 17-24, 2023, pages 16901–16910. IEEE, 2023. 2

2023
[44]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InComputer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, pages 405–421. Springer, 2020. 2

2020
[45]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InComputer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, pages 405–421. Springer, 2020. 3

2020
[46]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rab- bat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patrick...

2024
[47]

Continuous remeshing for inverse rendering

Werner Palfinger. Continuous remeshing for inverse rendering. Comput. Animat. Virtual Worlds, 33(5), 2022. 5

2022
[48]

Barron, Sofien Bouaziz, Dan B

Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B. Goldman, Steven M. Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 5845–5854. IEEE, 2021. 2

2021
[49]

Barron, Sofien Bouaziz, Dan B

Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T. Barron, Sofien Bouaziz, Dan B. Goldman, Ricardo Martin- Brualla, and Steven M. Seitz. Hypernerf: a higher- dimensional representation for topologically varying neural radiance fields.ACM Trans. Graph., 40(6):238:1–238:12,
[50]

A 3d face model for pose and illumination invariant face recognition

Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romd- hani, and Thomas Vetter. A 3d face model for pose and illumination invariant face recognition. InSixth IEEE Inter- national Conference on Advanced Video and Signal Based Surveillance, AVSS 2009, 2-4 September 2009, Genova, Italy, pages 296–301. IEEE Computer Society, 2009. 2

2009
[51]

Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians

Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 20299–20309. IEEE, 2024. 1, 2, 6

2024
[52]

Vi- sion transformers for dense prediction

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 12159– 12168. IEEE, 2021. 3, 4

2021
[53]

Li, and Shan Liu

Yurui Ren, Ge Li, Yuanqi Chen, Thomas H. Li, and Shan Liu. Pirenderer: Controllable portrait image generation via semantic neural rendering. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 13739–13748. IEEE,

2021
[54]

First order motion model for image animation

Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. InAdvances in Neural Information Process- ing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 7135–7145, 2019. 2

2019
[55]

Next3d: Genera- tive neural texture rasterization for 3d-aware head avatars

Jingxiang Sun, Xuan Wang, Lizhen Wang, Xiaoyu Li, Yong Zhang, Hongwen Zhang, and Yebin Liu. Next3d: Genera- tive neural texture rasterization for 3d-aware head avatars. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17- 24, 2023, pages 20991–21002. IEEE, 2023. 2

2023
[56]

CGOF++: controllable 3d face synthesis with conditional generative occupancy fields

Keqiang Sun, Shangzhe Wu, Ning Zhang, Zhaoyang Huang, Quan Wang, and Hongsheng Li. CGOF++: controllable 3d face synthesis with conditional generative occupancy fields. IEEE Trans. Pattern Anal. Mach. Intell., 46(2):913–926, 2024. 2

2024
[57]

Real-time neural radiance talking portrait synthe- sis via audio-spatial decomposition.CoRR, abs/2211.12368,

Jiaxiang Tang, Kaisiyuan Wang, Hang Zhou, Xiaokang Chen, Dongliang He, Tianshu Hu, Jingtuo Liu, Gang Zeng, and Jing- dong Wang. Real-time neural radiance talking portrait synthe- sis via audio-spatial decomposition.CoRR, abs/2211.12368,

work page arXiv
[58]

3dfaceshop: Explicitly controllable 3d-aware portrait generation.IEEE Trans

Junshu Tang, Bo Zhang, Binxin Yang, Ting Zhang, Dong Chen, Lizhuang Ma, and Fang Wen. 3dfaceshop: Explicitly controllable 3d-aware portrait generation.IEEE Trans. Vis. Comput. Graph., 30(9):6020–6037, 2024. 2

2024
[59]

EMO: emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions

Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. EMO: emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. InCom- puter Vision - ECCV 2024 - 18th European Conference, Mi- lan, Italy, September 29-October 4, 2024, Proceedings, Part LXXXIII, pages 244–260. Springer, 2024. 2

2024
[60]

Non- rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video

Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Christoph Lassner, and Christian Theobalt. Non- rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 12939–12...

2021
[61]

Progressive disentangled representation learning for fine-grained controllable talking head synthesis

Duomin Wang, Yu Deng, Zixin Yin, Heung-Yeung Shum, and Baoyuan Wang. Progressive disentangled representation learning for fine-grained controllable talking head synthesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17- 24, 2023, pages 17979–17989. IEEE, 2023. 2

2023
[62]

Gaussianeditor: Editing 3d gaussians delicately with text instructions

Junjie Wang, Jiemin Fang, Xiaopeng Zhang, Lingxi Xie, and Qi Tian. Gaussianeditor: Editing 3d gaussians delicately with text instructions. InIEEE/CVF Conference on Computer 11 Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 20902–20911. IEEE, 2024. 7

2024
[63]

One-shot free-view neural talking-head synthesis for video conferenc- ing

Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferenc- ing. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 10039–10049. Computer Vision Foundation / IEEE, 2021. 1, 2

2021
[64]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 7, 8

work page internal anchor Pith review arXiv 2025
[65]

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with lin- ear diffusion transformers.arXiv preprint arXiv:2410.10629,

work page internal anchor Pith review arXiv
[66]

VFHQ: A high-quality dataset and benchmark for video face super-resolution

Liangbin Xie, Xintao Wang, Honglun Zhang, Chao Dong, and Ying Shan. VFHQ: A high-quality dataset and benchmark for video face super-resolution. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2022, New Orleans, LA, USA, June 19-20, 2022, pages 656–665. IEEE, 2022. 6

2022
[67]

PV3D: A 3d generative model for portrait video generation

Eric Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Wen- qing Zhang, Song Bai, Jiashi Feng, and Mike Zheng Shou. PV3D: A 3d generative model for portrait video generation. InThe Eleventh International Conference on Learning Rep- resentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 2

2023
[68]

Hallo: Hierarchical audio-driven visual synthesis for portrait image animation.arXiv preprint arXiv:2406.08801, 2024

Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, and Siyu Zhu. Hallo: Hierarchical audio-driven visual synthesis for portrait image animation.CoRR, abs/2406.08801, 2024. 2

work page arXiv 2024
[69]

Deep 3d portrait from a single image

Sicheng Xu, Jiaolong Yang, Dong Chen, Fang Wen, Yu Deng, Yunde Jia, and Xin Tong. Deep 3d portrait from a single image. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 7707–7717. Computer Vision Foundation / IEEE, 2020. 1, 2

2020
[70]

Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaus- sians

Yuelang Xu, Bengwang Chen, Zhe Li, Hongwen Zhang, Lizhen Wang, Zerong Zheng, and Yebin Liu. Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaus- sians. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 1931–1941. IEEE, 2024. 2

2024
[71]

Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan

Fei Yin, Yong Zhang, Xiaodong Cun, Mingdeng Cao, Yanbo Fan, Xuan Wang, Qingyan Bai, Baoyuan Wu, Jue Wang, and Yujiu Yang. Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XVII, pages 85–101. Sprin...

2022
[72]

NOFA: nerf-based one-shot facial avatar reconstruction

Wangbo Yu, Yanbo Fan, Yong Zhang, Xuan Wang, Fei Yin, Yunpeng Bai, Yan-Pei Cao, Ying Shan, Yang Wu, Zhongqian Sun, and Baoyuan Wu. NOFA: nerf-based one-shot facial avatar reconstruction. InACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH 2023, Los Angeles, CA, USA, Au- gust 6-10, 2023, pages 85:1–85:12. ACM, 2023. 2

2023
[73]

Lempitsky

Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor S. Lempitsky. Few-shot adversarial learning of realistic neural talking head models. In2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 9458–9467. IEEE, 2019. 2

2019
[74]

Metaportrait: Identity-preserving talking head genera- tion with fast personalized adaptation

Bowen Zhang, Chenyang Qi, Pan Zhang, Bo Zhang, Hsiang- Tao Wu, Dong Chen, Qifeng Chen, Yong Wang, and Fang Wen. Metaportrait: Identity-preserving talking head genera- tion with fast personalized adaptation. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 22096–22105. IEEE, 2023. 2

2023
[75]

Sadtalker: Learning realistic 3d motion coefficients for stylized audio- driven single image talking face animation

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio- driven single image talking face animation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 8652–8661. IEEE, 2023. 2

2023
[76]

Sadtalker: Learning realistic 3d motion coefficients for stylized audio- driven single image talking face animation

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio- driven single image talking face animation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 8652–8661. IEEE, 2023. 1, 2

2023
[77]

Learning dynamic tetrahedra for high- quality talking head synthesis

Zicheng Zhang, Ruobing Zheng, Bonan Li, Congying Han, Tianqi Li, Meng Wang, Tiande Guo, Jingdong Chen, Ziwen Liu, and Ming Yang. Learning dynamic tetrahedra for high- quality talking head synthesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 5209–5219. IEEE,

2024
[78]

Havatar: High-fidelity head avatar via facial model conditioned neural radiance field

Xiaochen Zhao, Lizhen Wang, Jingxiang Sun, Hongwen Zhang, Jinli Suo, and Yebin Liu. Havatar: High-fidelity head avatar via facial model conditioned neural radiance field. ACM Trans. Graph., 43(1):6:1–6:16, 2024. 1, 2

2024
[79]

Black, and Otmar Hilliges

Yufeng Zheng, Wang Yifan, Gordon Wetzstein, Michael J. Black, and Otmar Hilliges. Pointavatar: Deformable point- based head avatars from videos. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 21057– 21067. IEEE, 2023. 2

2023
[80]

Peiye Zhuang, Liqian Ma, Sanmi Koyejo, and Alexander G. Schwing. Controllable radiance fields for dynamic face syn- thesis. InInternational Conference on 3D Vision, 3DV 2022, Prague, Czech Republic, September 12-16, 2022, pages 1–11. IEEE, 2022. 2

2022
[81]

Instant volumetric head avatars

Wojciech Zielonka, Timo Bolkart, and Justus Thies. Instant volumetric head avatars. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 4574–4584. IEEE,

2023
[82]

More Results Visualization of Reconstructed Geometry and Texture Map.As shown in Fig

2 12 A. More Results Visualization of Reconstructed Geometry and Texture Map.As shown in Fig. 7, our method produces high- quality geometry and texture outputs from a single input im- age. The reconstructed mesh faithfully captures the subject’s facial structure, hair volume, and accessory shapes, demon- strating significant deformation from the FLAME tem...