pith. sign in

arxiv: 2606.01493 · v1 · pith:64B22QZLnew · submitted 2026-05-31 · 💻 cs.CV

Splatshot: 3D Face Avatar Generation from a Single Unconstrained Photo

Pith reviewed 2026-06-28 16:58 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D face avatarGaussian Splattingdiffusion modelssingle image reconstructiontraining-freemulti-view consistencyphotorealistic avatars3D feedback loop
0
0 comments X

The pith

SplatShot generates 3D face avatars from one photo by feeding diffusion predictions back into a 3D Gaussian Splatting model at each denoising step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reconstructing photorealistic 3D faces from a single unconstrained photo is hard because feed-forward 3D models fail on unusual inputs and diffusion models produce good images but inconsistent views. The paper shows that explicit 3D representations and 2D diffusion priors are complementary. It introduces a training-free method that couples them in the denoising process by jointly denoising multiple views and using a 3D feedback loop. At each step, clean images are predicted, the 3D model is refit, and the difference is used to adjust the noise. This produces avatars with better identity preservation and consistency than either approach alone. Experiments on wild images confirm the improvements.

Core claim

Given a base 3DGS face model and a single reference image, SplatShot jointly denoises all target views using a per-step 3D feedback loop. At each timestep, it predicts clean images from noisy latents, refits the 3DGS to these predictions, and back-propagates the photometric discrepancy between 3DGS re-renderings and 2D predictions into the noise estimate to steer the sampling toward 3D-coherent outputs.

What carries the argument

The per-step 3D feedback loop that refits the 3D Gaussian Splatting model to multi-view diffusion predictions and back-propagates photometric discrepancy to correct the noise estimates.

If this is right

  • Produces 3D avatars with superior identity preservation compared to base methods.
  • Achieves high photorealism from the diffusion prior while maintaining geometric consistency from 3DGS.
  • Works without any training or fine-tuning on the input image.
  • Handles diverse in-the-wild images effectively.
  • Ensures multi-view consistency in the generated avatars.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar feedback mechanisms could improve consistency in other 3D generation tasks beyond faces.
  • The approach might reduce the need for large multi-view training datasets in avatar creation.
  • Extending the loop to handle dynamic expressions or head poses could be a natural next step.

Load-bearing premise

That the photometric discrepancy between 3DGS re-renderings and 2D diffusion predictions can be back-propagated to produce geometrically consistent outputs without any training on the input image.

What would settle it

Observing multi-view inconsistencies or identity mismatches in the output avatars when tested on a set of unconstrained photos where the base 3DGS model performs poorly.

Figures

Figures reproduced from arXiv: 2606.01493 by Ashok Veeraraghavan, Guha Balakrishnan, Hao Liang, Joanna Li, Soumendu Majee, Zhixuan Ge.

Figure 1
Figure 1. Figure 1: From a casual group photo to individual 3D face avatars. Given a single unconstrained photograph (left), SplatShot produces an explicit 3D Gaussian Splatting (3DGS) [1] face avatar for each individual that can be rendered from novel viewpoints (right). Abstract Reconstructing a photorealistic 3D face avatar from a single unconstrained photo￾graph is challenging: feed-forward 3D Gaussian Splatting (3DGS) mo… view at source ↗
Figure 2
Figure 2. Figure 2: Cross-attention maps. Text tokens (top) activate over semantically distinct regions; image tokens (bottom) overlap broadly without spatial disentanglement. A parallel line of work edits exist￾ing 3DGS scenes by manipulating cross-attention maps to enforce multi￾view consistency [16]. These methods inverse-render 2D attention maps onto 3D Gaussians and re-project them, en￾suring different views attend to th… view at source ↗
Figure 3
Figure 3. Figure 3: Method overview. (Top) Given an input photograph Iin, SplatShot selects a matching base 3DGS model M and iteratively refines it through a 3DGS-guided img2img diffusion process, producing a final 3DGS avatar. (Bottom) At each denoising step, the UNet predicts per-view noise ϵ t,v θ conditioned on Iin, from which predicted clean images xˆ v 0 are decoded. These images are used to update M, which is then re-r… view at source ↗
Figure 4
Figure 4. Figure 4: Noise composition affects the structural progression of generated faces during the diffusion denoising process (see § 4.2). Top: using predicted noise alone yields corrupted early-step predictions with over-saturated colors and unstable backgrounds. Bottom (Ours): our noise mixture mechanism (Eq. 7) blends predicted and ground-truth noise, producing stable, identity-consistent predictions from the first st… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on 3D face avatar generation from a single image. Given a single unconstrained input (left), we compare our method with Intergsedit [16], LAM [19], Human￾3Diffusion [7], FaceLift [20], Arc2Avatar [6], DreamGaussian [15], GAGAvatar [3], and FastA￾vatar [4]. Previous methods yield synthetic-looking results, struggle with out-of-distribution inputs, produce low-quality novel views, or s… view at source ↗
Figure 6
Figure 6. Figure 6: Effect of base 3DGS model selection. Each pair of rows shows the same input generated with two different base models. Identity is successfully transferred in both cases, but hairstyle and head shape are largely inherited from the base. This is expected: trained identity encoders typically crop out hair, and hair geometry is inherently less stable for 3DGS reconstruction. We therefore prioritize hairstyle m… view at source ↗
Figure 7
Figure 7. Figure 7: Visual examples of ablation study. Left to right: effect of guidance scale λ, hybrid weight w, and denoising strength s. Without guidance (λ = 0), 3DGS rendering lacks 3D consistency in novel views. Excessive guidance (λ = 100,000) over-constrains the output. Without hybrid prediction (w = 1), early inconsistent predictions force Gaussians to compensate with unstable positions and colors, degrading renderi… view at source ↗
Figure 9
Figure 9. Figure 9: Novel view renderings from refitted 3DGS. Top: GAGAvatar’s outputs refitted to a 3DGS and rendered from novel viewpoints. The severe artifacts (floaters, fragmented geometry, inconsistent structure) reveal that GAGAvatar’s multi-view outputs are not 3D-consistent, as its visual quality relies on a 2D neural renderer rather than an explicit 3D representation. Bottom: our method’s outputs refitted under the … view at source ↗
Figure 16
Figure 16. Figure 16: Attention injection for text-guided editing. Left: source image. Middle: editing with attention injection preserves structure and localizes the change. Right: editing without injection causes unintended structural changes. Extension to 3DGS: inverse-forward rendering. To enforce this consistency across multiple views of a 3DGS scene, prior work [16, 32, 33] inverse-renders the 2D attention maps onto the 3… view at source ↗
Figure 17
Figure 17. Figure 17: PeRFlow results. With only 4 sampling steps, the diffusion outputs (left) appear reasonable individually but the 3DGS renderings (right) reveal poor multi-view consistency due to insufficient guidance iterations. Since our geometry guidance operates at each denoising step, fewer steps would proportionally reduce the number of 3DGS refitting iterations and overall runtime. However, with only 4-8 guidance o… view at source ↗
Figure 8
Figure 8. Figure 8: Example identities in the NeRSemble dataset. These sequences serve as the geometric base [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Same base model, different input images. Top 4 and bottom 4 models have different input images (left), and same base 3DGS model (second column). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Same input image, different base models. Top 4 and bottom 4 models share a same input image (left), with different 3DGS base models (second column). 24 [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: CelebA-3D generation results (identities 00000–00029). For each identity, the leftmost [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: CelebA-3D generation results (identities 00030–00059). [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: CelebA-3D generation results (identities 00060–00089). [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative comparison between diffusion outputs and 3DGS renderings for identities [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
read the original abstract

Reconstructing a photorealistic 3D face avatar from a single unconstrained photograph is challenging: feed-forward 3D Gaussian Splatting (3DGS) models degrade on out-of-distribution inputs, while pretrained diffusion models produce high-fidelity images but lack multi-view consistency. We observe that these paradigms are fundamentally complementary: explicit 3D representations guarantee geometric consistency, whereas 2D diffusion priors ensure photorealism. Building on this, we propose SplatShot, a training-free framework that couples these representations directly within the denoising process. Given a base 3DGS face model and a single reference image, we jointly denoise all target views using a per-step 3D feedback loop. At each timestep, we predict clean images from the noisy latents, refit the 3DGS to these multi-view predictions, and back-propagate the photometric discrepancy between the 3DGS re-renderings and 2D predictions into the noise estimate. This steers the sampling trajectory toward strictly 3D-coherent, identity-faithful outputs. Experiments on diverse in-the-wild images demonstrate that SplatShot produces 3D avatars with superior identity preservation, photorealism, and multi-view consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents SplatShot, a training-free framework for generating 3D face avatars from a single unconstrained photograph. It starts from a base 3D Gaussian Splatting (3DGS) face model and a reference image, then jointly denoises multiple target views by means of a per-step feedback loop: at each timestep the diffusion model predicts clean images from noisy latents, the 3DGS is refit to those multi-view predictions, and the photometric discrepancy between the 3DGS re-renders and the 2D predictions is back-propagated to adjust the noise estimate. The authors claim this steers sampling toward outputs that are simultaneously 3D-coherent and identity-faithful. Experiments on diverse in-the-wild images are said to demonstrate superior identity preservation, photorealism, and multi-view consistency.

Significance. If the feedback mechanism reliably enforces geometric consistency without any per-image training or fine-tuning, the work would constitute a meaningful advance in single-image 3D reconstruction by directly coupling an explicit 3D representation with a pretrained 2D diffusion prior. The training-free character and the explicit use of photometric discrepancy as a corrective signal during sampling are clear strengths that distinguish the approach from purely feed-forward or purely generative baselines.

major comments (2)
  1. [Abstract; Method (per-step 3D feedback loop)] The central claim rests on the assumption that refitting the 3DGS model to diffusion x0 predictions supplies a usable geometric prior at every timestep, including high-noise regimes. No analysis, ablation, or stability argument is supplied to counter the possibility that early-timestep x0 estimates (dominated by the diffusion prior) produce degenerate or unstable 3DGS fits whose photometric discrepancy signal is noisy or contradictory (see the per-step loop description in the abstract and the method section).
  2. [Experiments] The assertion that SplatShot produces “superior” identity preservation, photorealism, and multi-view consistency is supported solely by qualitative statements. No quantitative metrics, ablation studies, error analysis, or baseline comparisons appear in the reported experiments, leaving the effectiveness of the back-propagation step unquantified (see Experiments section).
minor comments (2)
  1. [Method] The precise mathematical form of the back-propagation step (how photometric discrepancy modifies the noise estimate) should be stated explicitly, ideally with a short equation or pseudocode block.
  2. [Figures] Figure captions and axis labels in the qualitative results should indicate the exact viewpoints and reference image used so that multi-view consistency claims can be visually verified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract; Method (per-step 3D feedback loop)] The central claim rests on the assumption that refitting the 3DGS model to diffusion x0 predictions supplies a usable geometric prior at every timestep, including high-noise regimes. No analysis, ablation, or stability argument is supplied to counter the possibility that early-timestep x0 estimates (dominated by the diffusion prior) produce degenerate or unstable 3DGS fits whose photometric discrepancy signal is noisy or contradictory (see the per-step loop description in the abstract and the method section).

    Authors: We acknowledge that the manuscript does not include explicit analysis, ablations, or stability arguments for 3DGS refitting behavior specifically in high-noise regimes. The approach is designed around iterative refinement, where the photometric feedback progressively improves consistency as denoising proceeds from noisy to clean states. To directly address this concern, we will add a dedicated discussion of the feedback loop's behavior across timesteps along with an ablation examining the effect of initiating the 3D refitting at different noise levels. revision: yes

  2. Referee: [Experiments] The assertion that SplatShot produces “superior” identity preservation, photorealism, and multi-view consistency is supported solely by qualitative statements. No quantitative metrics, ablation studies, error analysis, or baseline comparisons appear in the reported experiments, leaving the effectiveness of the back-propagation step unquantified (see Experiments section).

    Authors: We agree that the current experiments section relies on qualitative demonstrations and lacks quantitative support for the superiority claims. The manuscript focuses on visual results across diverse in-the-wild inputs to highlight the method's practical advantages. In revision we will add quantitative metrics (e.g., identity similarity via ArcFace, multi-view consistency via cross-view PSNR/LPIPS), error analysis, and comparisons against relevant baselines, together with ablations isolating the contribution of the photometric feedback. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is a heuristic loop without reduction to inputs

full rationale

The paper describes an algorithmic procedure (per-step prediction, refit of 3DGS, photometric back-propagation into noise) rather than a derivation claiming first-principles predictions or uniqueness. No equations or steps reduce by construction to fitted parameters, self-citations, or renamed inputs. The framework is presented as training-free and externally testable via experiments on in-the-wild images, with no load-bearing self-citation chains or self-definitional elements visible in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that 3DGS and diffusion priors are complementary in the specific way described and that the feedback loop converges to consistent geometry without additional regularization or training.

axioms (1)
  • domain assumption Explicit 3D representations guarantee geometric consistency while 2D diffusion priors ensure photorealism.
    Stated in the abstract as the foundational observation that motivates the method.

pith-pipeline@v0.9.1-grok · 5768 in / 1303 out tokens · 14670 ms · 2026-06-28T16:58:24.637248+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

97 extracted references · 14 canonical work pages · 8 internal anchors

  1. [1]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

  2. [2]

    Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65 (1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65 (1):99–106, 2021

  3. [3]

    Generalizable and animatable gaussian head avatar.Advances in Neural Information Processing Systems, 37:57642–57670, 2024

    Xuangeng Chu and Tatsuya Harada. Generalizable and animatable gaussian head avatar.Advances in Neural Information Processing Systems, 37:57642–57670, 2024

  4. [4]

    Fastavatar: Instant 3d gaussian splatting for faces from single unconstrained poses

    Hao Liang, Zhixuan Ge, Soumendu Majee, Ashish Tiwari, GM Godaliyadda, Ashok Veeraraghavan, and Guha Balakrishnan. Fastavatar: Instant 3d gaussian splatting for faces from single unconstrained poses. arXiv preprint arXiv:2508.18389, 2025

  5. [5]

    Efficient geometry-aware 3d generative adversarial networks

    Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16123–16133, 2022

  6. [6]

    Arc2avatar: Generating expressive 3d avatars from a single image via id guidance

    Dimitrios Gerogiannis, Foivos Paraperas Papantoniou, Rolandos Alexandros Potamias, Alexandros Lattas, and Stefanos Zafeiriou. Arc2avatar: Generating expressive 3d avatars from a single image via id guidance. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10770–10782, 2025

  7. [7]

    Human-3diffusion: realistic avatar creation via explicit 3d consistent diffusion models.Advances in Neural Information Processing Systems, 37:99601–99645, 2024

    Yuxuan Xue, Xianghui Xie, Riccardo Marin, and Gerard Pons-Moll. Human-3diffusion: realistic avatar creation via explicit 3d consistent diffusion models.Advances in Neural Information Processing Systems, 37:99601–99645, 2024

  8. [8]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022

  9. [9]

    Face recognition based on fitting a 3d morphable model.IEEE Transactions on pattern analysis and machine intelligence, 25(9):1063–1074, 2003

    V olker Blanz and Thomas Vetter. Face recognition based on fitting a 3d morphable model.IEEE Transactions on pattern analysis and machine intelligence, 25(9):1063–1074, 2003

  10. [10]

    Learning to generate conditional tri-plane for 3d-aware expression controllable portrait animation

    Taekyung Ki, Dongchan Min, and Gyeongsu Chae. Learning to generate conditional tri-plane for 3d-aware expression controllable portrait animation. InEuropean Conference on Computer Vision, pages 476–493. Springer, 2024

  11. [11]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  12. [12]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019

  13. [13]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 10

  14. [14]

    Next3d: Generative neural texture rasterization for 3d-aware head avatars

    Jingxiang Sun, Xuan Wang, Lizhen Wang, Xiaoyu Li, Yong Zhang, Hongwen Zhang, and Yebin Liu. Next3d: Generative neural texture rasterization for 3d-aware head avatars. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20991–21002, 2023

  15. [15]

    DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

    Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation.arXiv preprint arXiv:2309.16653, 2023

  16. [16]

    Intergsedit: Interactive 3d gaussian splatting editing with 3d geometry-consistent attention prior

    Minghao Wen, Shengjie Wu, Kangkan Wang, and Dong Liang. Intergsedit: Interactive 3d gaussian splatting editing with 3d geometry-consistent attention prior. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26136–26145, 2025

  17. [17]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

  18. [18]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  19. [19]

    Lam: large avatar model for one-shot animatable gaussian head

    Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: large avatar model for one-shot animatable gaussian head. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–13, 2025

  20. [20]

    Facelift: Learning generalizable single image 3d face reconstruction from synthetic heads

    Weijie Lyu, Yi Zhou, Ming-Hsuan Yang, and Zhixin Shu. Facelift: Learning generalizable single image 3d face reconstruction from synthetic heads. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12691–12701, 2025

  21. [21]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015

  22. [22]

    Nersemble: Multi-view radiance field reconstruction of human heads.ACM Transactions on Graphics (TOG), 42(4): 1–14, 2023

    Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim Walter, and Matthias Nießner. Nersemble: Multi-view radiance field reconstruction of human heads.ACM Transactions on Graphics (TOG), 42(4): 1–14, 2023

  23. [23]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

  24. [24]

    gsplat: An open-source library for gaussian splatting

    Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tancik, and Angjoo Kanazawa. gsplat: An open-source library for gaussian splatting. Journal of Machine Learning Research, 26(34):1–17, 2025

  25. [25]

    Arcface: Additive angular margin loss for deep face recognition

    Jiankang Deng, Jia Guo, Xue Niannan, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InCVPR, 2019

  26. [26]

    Black, and Timo Bolkart

    Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. Learning an animatable detailed 3d face model from in-the-wild images.ACM Trans. Graph., 40(4), July 2021. ISSN 0730-0301. doi: 10.1145/3450626.3459936. URLhttps://doi.org/10.1145/3450626.3459936

  27. [27]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  28. [28]

    Exploring clip for assessing the look and feel of images

    Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023

  29. [29]

    Perflow: Piecewise rectified flow as universal plug-and-play accelerator.Advances in Neural Information Processing Systems, 37:78630–78652, 2024

    Hanshu Yan, Xingchao Liu, Jiachun Pan, Jun Hao Liew, Qiang Liu, and Jiashi Feng. Perflow: Piecewise rectified flow as universal plug-and-play accelerator.Advances in Neural Information Processing Systems, 37:78630–78652, 2024

  30. [30]

    Retinaface: Single-shot multi-level face localisation in the wild

    Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5203–5212, 2020

  31. [31]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021. 11

  32. [32]

    Gaussianeditor: Swift and controllable 3d editing with gaussian splatting

    Yiwen Chen, Zilong Chen, Chi Zhang, Feng Wang, Xiaofeng Yang, Yikai Wang, Zhongang Cai, Lei Yang, Huaping Liu, and Guosheng Lin. Gaussianeditor: Swift and controllable 3d editing with gaussian splatting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21476–21485, 2024

  33. [33]

    View-consistent 3d editing with gaussian splatting

    Yuxuan Wang, Xuanyu Yi, Zike Wu, Na Zhao, Long Chen, and Hanwang Zhang. View-consistent 3d editing with gaussian splatting. InEuropean conference on computer vision, pages 404–420. Springer, 2024

  34. [34]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to- prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

  35. [35]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  36. [36]

    A morphable face albedo model

    William AP Smith, Alassane Seck, Hannah Dee, Bernard Tiddeman, Joshua B Tenenbaum, and Bernhard Egger. A morphable face albedo model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5011–5020, 2020

  37. [37]

    A morphable model for the synthesis of 3d faces

    V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 157–164. 2023

  38. [38]

    Learning formation of physically-based face attributes

    Ruilong Li, Karl Bladin, Yajie Zhao, Chinmay Chinara, Owen Ingraham, Pengda Xiang, Xinglei Ren, Pratusha Prasad, Bipin Kishore, Jun Xing, et al. Learning formation of physically-based face attributes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3410–3419, 2020

  39. [39]

    Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video

    Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Christoph Lassner, and Christian Theobalt. Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. InProceedings of the IEEE/CVF international conference on computer vision, pages 12959–12970, 2021

  40. [40]

    Nerfies: Deformable neural radiance fields

    Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5865–5874, 2021

  41. [41]

    Mofanerf: Morphable facial neural radiance field

    Yiyu Zhuang, Hao Zhu, Xusen Sun, and Xun Cao. Mofanerf: Morphable facial neural radiance field. In European conference on computer vision, pages 268–285. Springer, 2022

  42. [42]

    Headnerf: A real-time nerf-based parametric head model

    Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juyong Zhang. Headnerf: A real-time nerf-based parametric head model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20374–20384, 2022

  43. [43]

    Dynamic neural radiance fields for monocular 4d facial avatar reconstruction

    Guy Gafni, Justus Thies, Michael Zollhofer, and Matthias Nießner. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8649–8658, 2021

  44. [44]

    Marcel C. Buehler, Gengyan Li, Erroll Wood, Leonhard Helminger, Xu Chen, Tanmay Shah, Daoye Wang, Stephan Garbin, Sergio Orts-Escolano, Otmar Hilliges, Dmitry Lagun, Jérémy Riviere, Paulo Gotardo, Thabo Beeler, Abhimitra Meka, and Kripasindhu Sarkar. Cafca: High-quality novel view synthesis of expressive faces from casual few-shot captures. InACM SIGGRAPH...

  45. [45]

    Real-time radiance fields for single-image portrait view synthesis.ACM Transactions on Graphics (TOG), 42(4):1–15, 2023

    Alex Trevithick, Matthew Chan, Michael Stengel, Eric Chan, Chao Liu, Zhiding Yu, Sameh Khamis, Manmohan Chandraker, Ravi Ramamoorthi, and Koki Nagano. Real-time radiance fields for single-image portrait view synthesis.ACM Transactions on Graphics (TOG), 42(4):1–15, 2023

  46. [46]

    3d gaussian parametric head model

    Yuelang Xu, Lizhen Wang, Zerong Zheng, Zhaoqi Su, and Yebin Liu. 3d gaussian parametric head model. InEuropean Conference on Computer Vision, pages 129–147. Springer, 2024

  47. [47]

    Headgas: Real-time animatable head avatars via 3d gaussian splatting

    Helisa Dhamo, Yinyu Nie, Arthur Moreau, Jifei Song, Richard Shaw, Yiren Zhou, and Eduardo Pérez- Pellitero. Headgas: Real-time animatable head avatars via 3d gaussian splatting. InEuropean Conference on Computer Vision, pages 459–476. Springer, 2024

  48. [48]

    Graphavatar: Compact head avatars with gnn-generated 3d gaussians

    Xiaobao Wei, Peng Chen, Ming Lu, Hui Chen, and Feng Tian. Graphavatar: Compact head avatars with gnn-generated 3d gaussians. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 8295–8303, 2025. 12

  49. [49]

    3d gaussian blendshapes for head avatar animation

    Shengjie Ma, Yanlin Weng, Tianjia Shao, and Kun Zhou. 3d gaussian blendshapes for head avatar animation. InACM SIGGRAPH 2024 Conference Papers, pages 1–10, 2024

  50. [50]

    Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians

    Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20299–20309, 2024

  51. [51]

    Splattingavatar: Realistic real-time human avatars with mesh-embedded gaussian splatting

    Zhijing Shao, Zhaolong Wang, Zhuang Li, Duotun Wang, Xiangru Lin, Yu Zhang, Mingming Fan, and Zeyu Wang. Splattingavatar: Realistic real-time human avatars with mesh-embedded gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1606–1616, 2024

  52. [52]

    Flashavatar: High-fidelity head avatar with efficient gaussian embedding

    Jun Xiang, Xuan Gao, Yudong Guo, and Juyong Zhang. Flashavatar: High-fidelity head avatar with efficient gaussian embedding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1802–1812, 2024

  53. [53]

    Hugs: Human gaussian splats

    Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, and Anurag Ranjan. Hugs: Human gaussian splats. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 505–515, 2024

  54. [54]

    Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians

    Yuelang Xu, Benwang Chen, Zhe Li, Hongwen Zhang, Lizhen Wang, Zerong Zheng, and Yebin Liu. Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2024

  55. [55]

    3dgs-avatar: Animatable avatars via deformable 3d gaussian splatting

    Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang. 3dgs-avatar: Animatable avatars via deformable 3d gaussian splatting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5020–5030, 2024

  56. [56]

    Gasp: Gaussian avatars with synthetic priors

    Jack Saunders, Charlie Hewitt, Yanan Jian, Marek Kowalski, Tadas Baltrusaitis, Yiye Chen, Darren Cosker, Virginia Estellers, Nicholas Gydé, Vinay P Namboodiri, et al. Gasp: Gaussian avatars with synthetic priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 271–280, 2025

  57. [57]

    Gaussian deja-vu: Creating controllable 3d gaussian head-avatars with enhanced generalization and personalization abilities

    Peizhi Yan, Rabab Ward, Qiang Tang, and Shan Du. Gaussian deja-vu: Creating controllable 3d gaussian head-avatars with enhanced generalization and personalization abilities. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), pages 276–286, February 2025

  58. [58]

    Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians

    Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Shengping Zhang, and Liqiang Nie. Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 634–644, 2024

  59. [59]

    Mega: Hybrid mesh-gaussian head avatar for high-fidelity rendering and head editing

    Cong Wang, Di Kang, Heyi Sun, Shenhan Qian, Zixuan Wang, Linchao Bao, and Song-Hai Zhang. Mega: Hybrid mesh-gaussian head avatar for high-fidelity rendering and head editing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26274–26284, 2025

  60. [60]

    Npga: Neural parametric gaussian avatars

    Simon Giebenhain, Tobias Kirschstein, Martin Rünz, Lourdes Agapito, and Matthias Nießner. Npga: Neural parametric gaussian avatars. InSIGGRAPH Asia 2024 Conference Papers (SA Conference Papers ’24), December 3-6, Tokyo, Japan, 2024. ISBN 979-8-4007-1131-2/24/12. doi: 10.1145/3680528.3687689

  61. [61]

    Generalizable one-shot 3d neural head avatar.Advances in Neural Information Processing Systems, 36:47239–47250, 2023

    Xueting Li, Shalini De Mello, Sifei Liu, Koki Nagano, Umar Iqbal, and Jan Kautz. Generalizable one-shot 3d neural head avatar.Advances in Neural Information Processing Systems, 36:47239–47250, 2023

  62. [62]

    Otavatar: One-shot talking face avatar with controllable tri-plane rendering

    Zhiyuan Ma, Xiangyu Zhu, Guo-Jun Qi, Zhen Lei, and Lei Zhang. Otavatar: One-shot talking face avatar with controllable tri-plane rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16910, 2023

  63. [63]

    V oodoo 3d: V olumetric portrait disentanglement for one-shot 3d head reenactment

    Phong Tran, Egor Zakharov, Long-Nhat Ho, Anh Tuan Tran, Liwen Hu, and Hao Li. V oodoo 3d: V olumetric portrait disentanglement for one-shot 3d head reenactment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10336–10348, 2024

  64. [64]

    Facescape: a large-scale high quality 3d face dataset and detailed riggable 3d face prediction

    Haotian Yang, Hao Zhu, Yanru Wang, Mingkai Huang, Qiu Shen, Ruigang Yang, and Xun Cao. Facescape: a large-scale high quality 3d face dataset and detailed riggable 3d face prediction. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 601–610, 2020

  65. [65]

    Towards metrical reconstruction of human faces

    Wojciech Zielonka, Timo Bolkart, and Justus Thies. Towards metrical reconstruction of human faces. In European conference on computer vision, pages 250–269. Springer, 2022. 13

  66. [66]

    Talkinggaussian: Structure- persistent 3d talking head synthesis via gaussian splatting

    Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, and Lin Gu. Talkinggaussian: Structure- persistent 3d talking head synthesis via gaussian splatting. InEuropean Conference on Computer Vision, pages 127–145. Springer, 2024

  67. [67]

    Pointavatar: Deformable point-based head avatars from videos

    Yufeng Zheng, Wang Yifan, Gordon Wetzstein, Michael J Black, and Otmar Hilliges. Pointavatar: Deformable point-based head avatars from videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21057–21067, 2023

  68. [68]

    One-shot high-fidelity talking-head synthesis with de- formable neural radiance field

    Weichuang Li, Longhao Zhang, Dong Wang, Bin Zhao, Zhigang Wang, Mulin Chen, Bang Zhang, Zhongjian Wang, Liefeng Bo, and Xuelong Li. One-shot high-fidelity talking-head synthesis with de- formable neural radiance field. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17969–17978, 2023

  69. [69]

    Cvthead: One-shot con- trollable head avatar with vertex-feature transformer

    Haoyu Ma, Tong Zhang, Shanlin Sun, Xiangyi Yan, Kun Han, and Xiaohui Xie. Cvthead: One-shot con- trollable head avatar with vertex-feature transformer. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6131–6141, 2024

  70. [70]

    Learning dense correspondence for nerf-based face reenactment

    Songlin Yang, Wei Wang, Yushi Lan, Xiangyu Fan, Bo Peng, Lei Yang, and Jing Dong. Learning dense correspondence for nerf-based face reenactment. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 6522–6530, 2024

  71. [71]

    Gpavatar: Generalizable and precise head avatar from image (s).arXiv preprint arXiv:2401.10215, 2024

    Xuangeng Chu, Yu Li, Ailing Zeng, Tianyu Yang, Lijian Lin, Yunfei Liu, and Tatsuya Harada. Gpavatar: Generalizable and precise head avatar from image (s).arXiv preprint arXiv:2401.10215, 2024

  72. [72]

    Real3d-portrait: One-shot realistic 3d talking portrait synthesis.arXiv preprint arXiv:2401.08503, 2024

    Zhenhui Ye, Tianyun Zhong, Yi Ren, Jiaqi Yang, Weichuang Li, Jiawei Huang, Ziyue Jiang, Jinzheng He, Rongjie Huang, Jinglin Liu, et al. Real3d-portrait: One-shot realistic 3d talking portrait synthesis.arXiv preprint arXiv:2401.08503, 2024

  73. [73]

    Coherent 3d portrait video reconstruction via triplane fusion

    Shengze Wang, Xueting Li, Chao Liu, Matthew Chan, Michael Stengel, Henry Fuchs, Shalini De Mello, and Koki Nagano. Coherent 3d portrait video reconstruction via triplane fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10712–10722, June 2025

  74. [74]

    Generating diverse high-fidelity images with vq-vae-2

    Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019

  75. [75]

    Analyzing and improving the image quality of stylegan

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020

  76. [76]

    Alias-free generative adversarial networks.Advances in neural information processing systems, 34: 852–863, 2021

    Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks.Advances in neural information processing systems, 34: 852–863, 2021

  77. [77]

    Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

  78. [78]

    Gram: Generative radiance manifolds for 3d-aware image generation

    Yu Deng, Jiaolong Yang, Jianfeng Xiang, and Xin Tong. Gram: Generative radiance manifolds for 3d-aware image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10673–10683, 2022

  79. [79]

    Panohead: Geometry- aware 3d full-head synthesis in 360deg

    Sizhe An, Hongyi Xu, Yichun Shi, Guoxian Song, Umit Y Ogras, and Linjie Luo. Panohead: Geometry- aware 3d full-head synthesis in 360deg. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20950–20959, 2023

  80. [80]

    Spherehead: stable 3d full-head synthesis with spherical tri-plane representation

    Heyuan Li, Ce Chen, Tianhao Shi, Yuda Qiu, Sizhe An, Guanying Chen, and Xiaoguang Han. Spherehead: stable 3d full-head synthesis with spherical tri-plane representation. InEuropean Conference on Computer Vision, pages 324–341. Springer, 2024

Showing first 80 references.