pith. machine review for the scientific record. sign in

arxiv: 2604.22865 · v1 · submitted 2026-04-23 · 💻 cs.CV

Recognition: unknown

MeshLAM: Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords animatable mesh reconstructionone-shot 3D avatarfeed-forward head modelingtextured mesh generationsingle-image 3Dtransformer for meshesGRU progressive deformation
0
0 comments X

The pith

A single forward pass from one image yields a complete, animatable textured 3D head mesh.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that high-fidelity animatable head avatars can be generated directly as meshes from a single photograph without test-time optimization or multiple input views. It does this through a dual architecture that jointly handles vertex positions and texture maps via a shared transformer, followed by iterative refinement that progressively deforms the geometry and updates the appearance. A sympathetic reader would care because this removes the need for lengthy per-instance fitting or camera arrays, potentially allowing instant creation of moving 3D faces from ordinary snapshots. The approach centers on preserving mesh topology during deformation while anchoring textures to the source image through reprojection.

Core claim

MeshLAM produces complete mesh representations with inherent animatability from a single image in a single forward pass by employing a dual shape and texture map architecture that simultaneously processes mesh vertices and texture maps with extracted image features from a shared transformer backbone, using an iterative GRU-based decoding mechanism with progressive geometry deformation and texture refinement together with a reprojection-based texture guidance mechanism.

What carries the argument

Dual shape and texture map architecture with shared transformer backbone and iterative GRU-based progressive deformation, which jointly carves geometry and refines appearance while preventing collapse.

If this is right

  • The output meshes are immediately usable for animation without additional rigging steps.
  • Reconstruction quality, animation fidelity, and runtime speed all exceed those of optimization-based and multi-view baselines.
  • Topological integrity is maintained through the progressive deformation schedule even in a feed-forward setting.
  • Appearance remains anchored to the input image via the reprojection guidance term.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same single-pass pipeline could support real-time avatar updates if applied frame-by-frame to video.
  • If the deformation mechanism generalizes, the approach might extend to full-body or hand meshes with minimal changes.
  • Deployment on mobile devices becomes feasible once the forward pass fits within typical GPU memory limits.
  • Failure on extreme head poses would point to the need for stronger pose-invariant feature extraction.

Load-bearing premise

The shared transformer plus iterative GRU refinement can preserve mesh topology and produce coherent textures from single-image features alone.

What would settle it

Run the model on a held-out single image, apply standard facial animation rigs to the output mesh, and check whether self-intersections appear or the reprojected texture deviates from the input photo.

Figures

Figures reproduced from arXiv: 2604.22865 by Steven Hoi, Yisheng He.

Figure 1
Figure 1. Figure 1: Overall Framework. Our method reconstructs an animatable 3D texture head mesh from a single image through dual shape and texture branches. After extracting features from the input image with a shared transformer, the shape branch predicts vertex deformations while the texture branch synthesizes UV texture maps, both conditioned on a FLAME template. Both branches are refined iteratively via GRU decoders wit… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison of 3D head avatar creation and animation on challenging texture cases. Our mesh-based framework [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The cross-domain generalization capability of our frame [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of iterative mesh decoding. Without our GRU [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of texture space reprojection. Without our repro [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effect of part-aware deformation. Without semantic [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reconstructed geometry and texture map visualization. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: In the wild challenging lighting and occlusion. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison with more baselines. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
read the original abstract

We introduce MeshLAM, a feed-forward framework for one-shot animatable mesh head reconstruction that generates high-fidelity, animatable 3D head avatars from a single image. Unlike previous work that relies on time-consuming test-time optimization or extensive multi-view data, our method produces complete mesh representations with inherent animatability from a single image in a single forward pass. Our approach employs a dual shape and texture map architecture that simultaneously processes mesh vertices and texture map with extracted image features from a shared transformer backbone, allowing for coherent shape carving and appearance modeling. To prevent mesh collapse and ensure topological integrity during feed-forward deformation, we propose an iterative GRU-based decoding mechanism with progressive geometry deformation and texture refinement, coupled with a novel reprojection-based texture guidance mechanism that anchors appearance learning to the input image. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in reconstruction quality, animation capability, and computational efficiency. Project page at https://meshlam.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces MeshLAM, a feed-forward framework for one-shot reconstruction of animatable textured mesh avatars from a single image. It employs a dual shape and texture map architecture with a shared transformer backbone and an iterative GRU-based progressive deformation mechanism with reprojection-based texture guidance to produce complete meshes that are inherently animatable without test-time optimization or multi-view data. The authors claim superior performance over state-of-the-art methods in reconstruction quality, animation capability, and efficiency.

Significance. Should the quantitative results and ablations support the claims, this would be a notable contribution to the field of 3D avatar reconstruction by enabling efficient, single-image feed-forward generation of animatable meshes, which could impact applications in AR, VR, and digital humans. The avoidance of optimization at test time is particularly promising for real-time use cases.

major comments (1)
  1. The abstract states that the GRU-based mechanism 'prevents mesh collapse' and ensures 'topological integrity' during feed-forward deformation, but provides no details on explicit constraints such as fixed connectivity, Laplacian regularization, or collision terms. This is load-bearing for the central 'inherent animatability' claim, as monocular depth ambiguity could otherwise allow folding or inconsistent vertex/UV coherence under large expression changes.
minor comments (2)
  1. The abstract references 'extensive experiments' demonstrating outperformance but includes no quantitative metrics, ablation results, or error analysis; a brief summary of key numbers (e.g., reconstruction error, animation fidelity) should be added.
  2. Notation for the dual shape-texture architecture and the reprojection-based guidance is introduced without equations or pseudocode in the provided text; adding these would improve clarity of the iterative decoding process.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and will revise the manuscript to provide additional clarifications on the mechanisms for maintaining mesh stability.

read point-by-point responses
  1. Referee: The abstract states that the GRU-based mechanism 'prevents mesh collapse' and ensures 'topological integrity' during feed-forward deformation, but provides no details on explicit constraints such as fixed connectivity, Laplacian regularization, or collision terms. This is load-bearing for the central 'inherent animatability' claim, as monocular depth ambiguity could otherwise allow folding or inconsistent vertex/UV coherence under large expression changes.

    Authors: We thank the referee for highlighting this important point regarding clarity. In MeshLAM, we employ a fixed-topology template mesh (detailed in Section 3.1), with predefined vertex connectivity that remains constant throughout the deformation process; this design choice inherently preserves topology without requiring additional constraints. The iterative GRU-based decoding (Section 3.2) performs progressive, multi-step geometry refinement rather than a single large update, which reduces the likelihood of folding or collapse arising from monocular depth ambiguity. The reprojection-based texture guidance mechanism further promotes coherence by anchoring texture predictions to the input image via differentiable rendering, helping maintain vertex-UV consistency during animation. While our loss does not include explicit Laplacian regularization or collision terms, the combination of fixed connectivity, iterative refinement, and guidance has yielded stable results in our experiments and ablations. We agree the abstract and method section would benefit from more explicit discussion of these aspects and will revise accordingly, including expanded text and a supporting figure in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents an architectural ML model (shared transformer + dual shape/texture maps + iterative GRU decoding with reprojection guidance) for single-image mesh avatar reconstruction. No equations, first-principles derivations, or 'predictions' are described that reduce to fitted inputs or self-citations by construction. The central claim of inherent animatability is an empirical outcome of the feed-forward network design rather than a mathematical identity or renamed fit. The provided abstract and context contain no load-bearing self-citations or ansatzes that collapse the result to its own inputs, making the derivation self-contained as a standard neural architecture proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only view provides no explicit free parameters, axioms, or invented entities beyond the high-level neural architecture; standard neural network training assumptions are implicit but unstated.

axioms (1)
  • domain assumption Neural networks trained on appropriate 3D head datasets can generalize to produce topologically valid meshes from single images.
    Implicit requirement for the feed-forward reconstruction to succeed.
invented entities (1)
  • MeshLAM dual shape-texture architecture no independent evidence
    purpose: Simultaneous mesh vertex and texture map processing from image features
    Core novel component introduced in the abstract; no independent evidence provided.

pith-pipeline@v0.9.0 · 5464 in / 1206 out tokens · 36138 ms · 2026-05-09T22:04:19.253730+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 12 canonical work pages · 3 internal anchors

  1. [1]

    Rignerf: Fully controllable neural 3d portraits

    ShahRukh Athar, Zexiang Xu, Kalyan Sunkavalli, Eli Shecht- man, and Zhixin Shu. Rignerf: Fully controllable neural 3d portraits. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 20332–20341. IEEE, 2022. 2

  2. [2]

    High-fidelity facial avatar reconstruction from monocular video with generative priors

    Yunpeng Bai, Yanbo Fan, Xuan Wang, Yong Zhang, Jingx- iang Sun, Chun Yuan, and Ying Shan. High-fidelity facial avatar reconstruction from monocular video with generative priors. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 4541–4551. IEEE, 2023. 2

  3. [3]

    Learning per- sonalized high quality volumetric head avatars from monoc- ular RGB videos

    Ziqian Bai, Feitong Tan, Zeng Huang, Kripasindhu Sarkar, Danhang Tang, Di Qiu, Abhimitra Meka, Ruofei Du, Ming- song Dou, Sergio Orts-Escolano, Rohit Pandey, Ping Tan, Thabo Beeler, Sean Fanello, and Yinda Zhang. Learning per- sonalized high quality volumetric head avatars from monoc- ular RGB videos. InIEEE/CVF Conference on Computer Vision and Pattern R...

  4. [4]

    A morphable model for the synthesis of 3d faces

    V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. InProceedings of the 26th Annual Con- ference on Computer Graphics and Interactive Techniques, SIGGRAPH 1999, Los Angeles, CA, USA, August 8-13, 1999, pages 187–194. ACM, 1999. 2

  5. [5]

    How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230, 000 3d facial landmarks)

    Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230, 000 3d facial landmarks). InIEEE Interna- tional Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 1021–1030. IEEE Com- puter Society, 2017. 7

  6. [6]

    Lempitsky

    Egor Burkov, Igor Pasechnik, Artur Grigorev, and Victor S. Lempitsky. Neural head reenactment with latent pose descrip- tors. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13- 19, 2020, pages 13783–13792. Computer Vision Foundation / IEEE, 2020. 2

  7. [7]

    Generalizable and ani- matable gaussian head avatar

    Xuangeng Chu and Tatsuya Harada. Generalizable and ani- matable gaussian head avatar. InAdvances in Neural Infor- mation Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Van- couver, BC, Canada, December 10 - 15, 2024, 2024. 2, 5, 6, 9, 1

  8. [8]

    Gpavatar: Generalizable and precise head avatar from image(s)

    Xuangeng Chu, Yu Li, Ailing Zeng, Tianyu Yang, Lijian Lin, Yunfei Liu, and Tatsuya Harada. Gpavatar: Generalizable and precise head avatar from image(s). InThe Twelfth Interna- tional Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 6

  9. [9]

    Hallo2: Long-duration and high- resolution audio-driven portrait image animation.arXiv preprint arXiv:2410.07718,

    Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, and Jingdong Wang. Hallo2: Long-duration and high-resolution audio-driven portrait im- age animation.CoRR, abs/2410.07718, 2024. 2

  10. [10]

    Arcface: Additive angular margin loss for deep face recognition.IEEE Trans

    Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Kotsia, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition.IEEE Trans. Pattern Anal. Mach. Intell., 44(10):5962–5979, 2022. 6

  11. [11]

    Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer

    Yu Deng, Duomin Wang, and Baoyuan Wang. Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer. In Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XVII, pages 316–333. Springer, 2024. 1

  12. [12]

    Face parsing model

    Jonathan Dinu. Face parsing model. https : / / huggingface . co / jonathandinu / face - parsing, 2022. Accessed: 2025-01-23. 5, 6, 7

  13. [14]

    Mega- portraits: One-shot megapixel neural head avatars

    Nikita Drobyshev, Jenya Chelishev, Taras Khakhulin, Aleksei Ivakhnenko, Victor Lempitsky, and Egor Zakharov. Mega- portraits: One-shot megapixel neural head avatars. InMM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, pages 2663–2671. ACM, 2022. 2

  14. [15]

    Dynamic neural radiance fields for monocular 4d facial avatar reconstruction

    Guy Gafni, Justus Thies, Michael Zollhöfer, and Matthias Nießner. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 8649–8658. Computer Vision Foundation / IEEE, 2021. 2

  15. [16]

    Reconstructing personalized seman- tic facial nerf models from monocular video.ACM Trans

    Xuan Gao, Chenglai Zhong, Jun Xiang, Yang Hong, Yudong Guo, and Juyong Zhang. Reconstructing personalized seman- tic facial nerf models from monocular video.ACM Trans. Graph., 41(6):200:1–200:12, 2022. 2

  16. [17]

    Morphable face models - an open frame- work

    Thomas Gerig, Andreas Morel-Forster, Clemens Blumer, Bernhard Egger, Marcel Lüthi, Sandro Schönborn, and Thomas Vetter. Morphable face models - an open frame- work. In13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018, Xi’an, China, May 15-19, 2018, pages 75–82. IEEE Computer Society, 2018. 2

  17. [18]

    Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems 27: Annual Confer- ence on Neural Information Processing Systems 2014, Decem- ber 8-13 2014, Montreal, Quebec, Canada, pages 2672–2680,

  18. [19]

    arXiv preprint arXiv:2407.03168 , year =

    Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Effi- 9 cient portrait animation with stitching and retargeting control. CoRR, abs/2407.03168, 2024. 2

  19. [21]

    Ad-nerf: Audio driven neural ra- diance fields for talking head synthesis

    Yudong Guo, Keyu Chen, Sen Liang, Yong-Jin Liu, Hujun Bao, and Juyong Zhang. Ad-nerf: Audio driven neural ra- diance fields for talking head synthesis. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 5764–

  20. [22]

    Efros, Aleksander Holynski, and Angjoo Kanazawa

    Ayaan Haque, Matthew Tancik, Alexei A. Efros, Aleksander Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Edit- ing 3d scenes with instructions. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 19683–19693. IEEE, 2023. 7

  21. [23]

    Freditor: High-fidelity and transfer- able nerf editing by frequency decomposition

    Yisheng He, Weihao Yuan, Siyu Zhu, Zilong Dong, Liefeng Bo, and Qixing Huang. Freditor: High-fidelity and transfer- able nerf editing by frequency decomposition. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLI, pages 73–91. Springer, 2024. 7

  22. [24]

    Lam: Large avatar model for one-shot animatable gaus- sian head

    Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: Large avatar model for one-shot animatable gaus- sian head. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–13, 2025. 1, 2, 3, 9

  23. [25]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. 6

  24. [26]

    Depth- aware generative adversarial network for talking head video generation

    Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu. Depth- aware generative adversarial network for talking head video generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 3387–3396. IEEE, 2022. 2

  25. [27]

    Headnerf: A real-time nerf-based parametric head model.CoRR, abs/2112.05637, 2021

    Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juyong Zhang. Headnerf: A real-time nerf-based parametric head model.CoRR, abs/2112.05637, 2021. 2

  26. [28]

    Forge4d: Feed-forward 4d human reconstruction and interpo- lation from uncalibrated sparse-view videos.arXiv preprint arXiv:2509.24209, 2025

    Yingdong Hu, Yisheng He, Jinnan Chen, Weihao Yuan, Kejie Qiu, Zehong Lin, Siyu Zhu, Zilong Dong, and Jun Zhang. Forge4d: Feed-forward 4d human reconstruction and interpo- lation from uncalibrated sparse-view videos.arXiv preprint arXiv:2509.24209, 2025. 1

  27. [29]

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial net- works. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21- 26, 2017, pages 5967–5976. IEEE Computer Society, 2017. 2

  28. [30]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recogni- tion, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 4401–4410. Computer Vision Foundation / IEEE, 2019. 2

  29. [31]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139:1–139:14,

  30. [32]

    Realistic one-shot mesh-based head avatars

    Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. Realistic one-shot mesh-based head avatars. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part II, pages 345–362. Springer, 2022. 1, 2

  31. [33]

    Realistic one-shot mesh-based head avatars

    Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. Realistic one-shot mesh-based head avatars. InEuropean Conference on Computer Vision (ECCV), 2022. 2, 7

  32. [34]

    Sapiens: Foundation for human vision mod- els.arXiv preprint arXiv:2408.12569, 2024

    Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vision mod- els.arXiv preprint arXiv:2408.12569, 2024. 5

  33. [35]

    Learn- ing to generate conditional tri-plane for 3d-aware expression controllable portrait animation

    Taekyung Ki, Dongchan Min, and Gyeongsu Chae. Learn- ing to generate conditional tri-plane for 3d-aware expression controllable portrait animation. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part I, pages 476–493. Springer, 2024. 2

  34. [36]

    Avat3r: Large an- imatable gaussian reconstruction model for high-fidelity 3d head avatars.arXiv preprint arXiv:2502.20220, 2025

    Tobias Kirschstein, Javier Romero, Artem Sevastopolsky, Matthias Nießner, and Shunsuke Saito. Avat3r: Large an- imatable gaussian reconstruction model for high-fidelity 3d head avatars.arXiv preprint arXiv:2502.20220, 2025. 2

  35. [37]

    Panolam: Large avatar model for gaussian full- head synthesis from one-shot unposed image.arXiv preprint arXiv:2509.07552, 2025

    Peng Li, Yisheng He, Yingdong Hu, Yuan Dong, Weihao Yuan, Yuan Liu, Siyu Zhu, Gang Cheng, Zilong Dong, and Yike Guo. Panolam: Large avatar model for gaussian full- head synthesis from one-shot unposed image.arXiv preprint arXiv:2509.07552, 2025. 1, 2

  36. [38]

    Black, Hao Li, and Javier Romero

    Tianye Li, Timo Bolkart, Michael J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans.ACM Trans. Graph., 36(6):194:1–194:17,

  37. [39]

    Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expres- sion from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017. 1, 3

  38. [40]

    One-shot high-fidelity talking- head synthesis with deformable neural radiance field

    Weichuang Li, Longhao Zhang, Dong Wang, Bin Zhao, Zhi- gang Wang, Mulin Chen, Bang Zhang, Zhongjian Wang, Liefeng Bo, and Xuelong Li. One-shot high-fidelity talking- head synthesis with deformable neural radiance field. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17- 24, 2023, pages 17969–17978...

  39. [41]

    Soap: Style-omniscient animatable portraits

    Tingting Liao, Yujian Zheng, Adilbek Karmanov, Liwen Hu, Leyang Jin, Yuliang Xiu, and Hao Li. Soap: Style-omniscient animatable portraits. InACM SIGGRAPH 2025 Conference Proceedings, 2025. 2, 5

  40. [42]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chun- 10 rui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025. 7, 8

  41. [43]

    Otavatar: One-shot talking face avatar with con- trollable tri-plane rendering

    Zhiyuan Ma, Xiangyu Zhu, Guojun Qi, Zhen Lei, and Lei Zhang. Otavatar: One-shot talking face avatar with con- trollable tri-plane rendering. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Van- couver, BC, Canada, June 17-24, 2023, pages 16901–16910. IEEE, 2023. 2

  42. [44]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InComputer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, pages 405–421. Springer, 2020. 2

  43. [45]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InComputer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, pages 405–421. Springer, 2020. 3

  44. [46]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rab- bat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patrick...

  45. [47]

    Continuous remeshing for inverse rendering

    Werner Palfinger. Continuous remeshing for inverse rendering. Comput. Animat. Virtual Worlds, 33(5), 2022. 5

  46. [48]

    Barron, Sofien Bouaziz, Dan B

    Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B. Goldman, Steven M. Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 5845–5854. IEEE, 2021. 2

  47. [49]

    Barron, Sofien Bouaziz, Dan B

    Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T. Barron, Sofien Bouaziz, Dan B. Goldman, Ricardo Martin- Brualla, and Steven M. Seitz. Hypernerf: a higher- dimensional representation for topologically varying neural radiance fields.ACM Trans. Graph., 40(6):238:1–238:12,

  48. [50]

    A 3d face model for pose and illumination invariant face recognition

    Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romd- hani, and Thomas Vetter. A 3d face model for pose and illumination invariant face recognition. InSixth IEEE Inter- national Conference on Advanced Video and Signal Based Surveillance, AVSS 2009, 2-4 September 2009, Genova, Italy, pages 296–301. IEEE Computer Society, 2009. 2

  49. [51]

    Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians

    Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 20299–20309. IEEE, 2024. 1, 2, 6

  50. [52]

    Vi- sion transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 12159– 12168. IEEE, 2021. 3, 4

  51. [53]

    Li, and Shan Liu

    Yurui Ren, Ge Li, Yuanqi Chen, Thomas H. Li, and Shan Liu. Pirenderer: Controllable portrait image generation via semantic neural rendering. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 13739–13748. IEEE,

  52. [54]

    First order motion model for image animation

    Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. InAdvances in Neural Information Process- ing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 7135–7145, 2019. 2

  53. [55]

    Next3d: Genera- tive neural texture rasterization for 3d-aware head avatars

    Jingxiang Sun, Xuan Wang, Lizhen Wang, Xiaoyu Li, Yong Zhang, Hongwen Zhang, and Yebin Liu. Next3d: Genera- tive neural texture rasterization for 3d-aware head avatars. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17- 24, 2023, pages 20991–21002. IEEE, 2023. 2

  54. [56]

    CGOF++: controllable 3d face synthesis with conditional generative occupancy fields

    Keqiang Sun, Shangzhe Wu, Ning Zhang, Zhaoyang Huang, Quan Wang, and Hongsheng Li. CGOF++: controllable 3d face synthesis with conditional generative occupancy fields. IEEE Trans. Pattern Anal. Mach. Intell., 46(2):913–926, 2024. 2

  55. [57]

    Real-time neural radiance talking portrait synthe- sis via audio-spatial decomposition.CoRR, abs/2211.12368,

    Jiaxiang Tang, Kaisiyuan Wang, Hang Zhou, Xiaokang Chen, Dongliang He, Tianshu Hu, Jingtuo Liu, Gang Zeng, and Jing- dong Wang. Real-time neural radiance talking portrait synthe- sis via audio-spatial decomposition.CoRR, abs/2211.12368,

  56. [58]

    3dfaceshop: Explicitly controllable 3d-aware portrait generation.IEEE Trans

    Junshu Tang, Bo Zhang, Binxin Yang, Ting Zhang, Dong Chen, Lizhuang Ma, and Fang Wen. 3dfaceshop: Explicitly controllable 3d-aware portrait generation.IEEE Trans. Vis. Comput. Graph., 30(9):6020–6037, 2024. 2

  57. [59]

    EMO: emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions

    Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. EMO: emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. InCom- puter Vision - ECCV 2024 - 18th European Conference, Mi- lan, Italy, September 29-October 4, 2024, Proceedings, Part LXXXIII, pages 244–260. Springer, 2024. 2

  58. [60]

    Non- rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video

    Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Christoph Lassner, and Christian Theobalt. Non- rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 12939–12...

  59. [61]

    Progressive disentangled representation learning for fine-grained controllable talking head synthesis

    Duomin Wang, Yu Deng, Zixin Yin, Heung-Yeung Shum, and Baoyuan Wang. Progressive disentangled representation learning for fine-grained controllable talking head synthesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17- 24, 2023, pages 17979–17989. IEEE, 2023. 2

  60. [62]

    Gaussianeditor: Editing 3d gaussians delicately with text instructions

    Junjie Wang, Jiemin Fang, Xiaopeng Zhang, Lingxi Xie, and Qi Tian. Gaussianeditor: Editing 3d gaussians delicately with text instructions. InIEEE/CVF Conference on Computer 11 Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 20902–20911. IEEE, 2024. 7

  61. [63]

    One-shot free-view neural talking-head synthesis for video conferenc- ing

    Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferenc- ing. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 10039–10049. Computer Vision Foundation / IEEE, 2021. 1, 2

  62. [64]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 7, 8

  63. [65]

    SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with lin- ear diffusion transformers.arXiv preprint arXiv:2410.10629,

  64. [66]

    VFHQ: A high-quality dataset and benchmark for video face super-resolution

    Liangbin Xie, Xintao Wang, Honglun Zhang, Chao Dong, and Ying Shan. VFHQ: A high-quality dataset and benchmark for video face super-resolution. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2022, New Orleans, LA, USA, June 19-20, 2022, pages 656–665. IEEE, 2022. 6

  65. [67]

    PV3D: A 3d generative model for portrait video generation

    Eric Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Wen- qing Zhang, Song Bai, Jiashi Feng, and Mike Zheng Shou. PV3D: A 3d generative model for portrait video generation. InThe Eleventh International Conference on Learning Rep- resentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 2

  66. [68]

    Hallo: Hierarchical audio-driven visual synthesis for portrait image animation.arXiv preprint arXiv:2406.08801, 2024

    Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, and Siyu Zhu. Hallo: Hierarchical audio-driven visual synthesis for portrait image animation.CoRR, abs/2406.08801, 2024. 2

  67. [69]

    Deep 3d portrait from a single image

    Sicheng Xu, Jiaolong Yang, Dong Chen, Fang Wen, Yu Deng, Yunde Jia, and Xin Tong. Deep 3d portrait from a single image. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 7707–7717. Computer Vision Foundation / IEEE, 2020. 1, 2

  68. [70]

    Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaus- sians

    Yuelang Xu, Bengwang Chen, Zhe Li, Hongwen Zhang, Lizhen Wang, Zerong Zheng, and Yebin Liu. Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaus- sians. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 1931–1941. IEEE, 2024. 2

  69. [71]

    Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan

    Fei Yin, Yong Zhang, Xiaodong Cun, Mingdeng Cao, Yanbo Fan, Xuan Wang, Qingyan Bai, Baoyuan Wu, Jue Wang, and Yujiu Yang. Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XVII, pages 85–101. Sprin...

  70. [72]

    NOFA: nerf-based one-shot facial avatar reconstruction

    Wangbo Yu, Yanbo Fan, Yong Zhang, Xuan Wang, Fei Yin, Yunpeng Bai, Yan-Pei Cao, Ying Shan, Yang Wu, Zhongqian Sun, and Baoyuan Wu. NOFA: nerf-based one-shot facial avatar reconstruction. InACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH 2023, Los Angeles, CA, USA, Au- gust 6-10, 2023, pages 85:1–85:12. ACM, 2023. 2

  71. [73]

    Lempitsky

    Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor S. Lempitsky. Few-shot adversarial learning of realistic neural talking head models. In2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 9458–9467. IEEE, 2019. 2

  72. [74]

    Metaportrait: Identity-preserving talking head genera- tion with fast personalized adaptation

    Bowen Zhang, Chenyang Qi, Pan Zhang, Bo Zhang, Hsiang- Tao Wu, Dong Chen, Qifeng Chen, Yong Wang, and Fang Wen. Metaportrait: Identity-preserving talking head genera- tion with fast personalized adaptation. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 22096–22105. IEEE, 2023. 2

  73. [75]

    Sadtalker: Learning realistic 3d motion coefficients for stylized audio- driven single image talking face animation

    Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio- driven single image talking face animation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 8652–8661. IEEE, 2023. 2

  74. [76]

    Sadtalker: Learning realistic 3d motion coefficients for stylized audio- driven single image talking face animation

    Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio- driven single image talking face animation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 8652–8661. IEEE, 2023. 1, 2

  75. [77]

    Learning dynamic tetrahedra for high- quality talking head synthesis

    Zicheng Zhang, Ruobing Zheng, Bonan Li, Congying Han, Tianqi Li, Meng Wang, Tiande Guo, Jingdong Chen, Ziwen Liu, and Ming Yang. Learning dynamic tetrahedra for high- quality talking head synthesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 5209–5219. IEEE,

  76. [78]

    Havatar: High-fidelity head avatar via facial model conditioned neural radiance field

    Xiaochen Zhao, Lizhen Wang, Jingxiang Sun, Hongwen Zhang, Jinli Suo, and Yebin Liu. Havatar: High-fidelity head avatar via facial model conditioned neural radiance field. ACM Trans. Graph., 43(1):6:1–6:16, 2024. 1, 2

  77. [79]

    Black, and Otmar Hilliges

    Yufeng Zheng, Wang Yifan, Gordon Wetzstein, Michael J. Black, and Otmar Hilliges. Pointavatar: Deformable point- based head avatars from videos. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 21057– 21067. IEEE, 2023. 2

  78. [80]

    Peiye Zhuang, Liqian Ma, Sanmi Koyejo, and Alexander G. Schwing. Controllable radiance fields for dynamic face syn- thesis. InInternational Conference on 3D Vision, 3DV 2022, Prague, Czech Republic, September 12-16, 2022, pages 1–11. IEEE, 2022. 2

  79. [81]

    Instant volumetric head avatars

    Wojciech Zielonka, Timo Bolkart, and Justus Thies. Instant volumetric head avatars. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 4574–4584. IEEE,

  80. [82]

    More Results Visualization of Reconstructed Geometry and Texture Map.As shown in Fig

    2 12 A. More Results Visualization of Reconstructed Geometry and Texture Map.As shown in Fig. 7, our method produces high- quality geometry and texture outputs from a single input im- age. The reconstructed mesh faithfully captures the subject’s facial structure, hair volume, and accessory shapes, demon- strating significant deformation from the FLAME tem...