pith. machine review for the scientific record. sign in

arxiv: 2605.04035 · v2 · submitted 2026-05-05 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures

Alejandro Blumentals, Artem Sevastopolsky, Brian Amberg, Christian Zimmermann, Dmitry Kostiaev, Evangelos Ntavelis, Fabio Maninchedda, Jeronimo Bayer, Mathias Deschler, Matthias Vestner, Mehak Gupta, Mohamad Shahbazi, Peter Kaufmann, Reinhard Knothe, Sean Wu, Sebastian Martin, Shridhar Ravikumar, Simon Schaefer, Stefan Brugger, Thomas Etterlin, Tom Runia, Trevor Phillips, Vittorio Megaro

Pith reviewed 2026-05-11 02:08 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords 3D Gaussian splattinghead reconstructionmulti-view capturefeed-forward modelUV parameterization3D avatarsfacial animation
1
0 comments X

The pith

HeadsUp reconstructs high-quality 3D Gaussian heads from multi-view images in a single feed-forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HeadsUp as a scalable method that turns sets of images captured from many cameras into detailed 3D head models represented as Gaussian points. An encoder compresses the views into a compact latent code, and a decoder expands it into 3D Gaussians whose positions are defined by a UV map on a standard neutral head shape. This design keeps the number of Gaussians independent of image count or resolution, allowing training on high-resolution data from more than 10,000 different people. The resulting model produces state-of-the-art reconstructions and works on entirely new faces without needing extra per-person optimization steps. The same latent space also supports creating new head identities and driving the models with facial expressions.

Core claim

HeadsUp is a scalable feed-forward method for reconstructing high-quality 3D Gaussian heads from large-scale multi-camera setups. It employs an efficient encoder-decoder architecture that compresses input views into a compact latent representation, which is decoded into a set of UV-parameterized 3D Gaussians anchored to a neutral head template. This enables training on an internal dataset with more than 10,000 subjects and achieves state-of-the-art reconstruction quality while generalizing to novel identities without test-time optimization.

What carries the argument

UV-parameterized 3D Gaussians anchored to a neutral head template, which decouples the number of Gaussians from the number and resolution of input images.

Load-bearing premise

The internal dataset of more than 10,000 subjects is diverse enough for the model to generalize accurately to unseen identities without test-time optimization.

What would settle it

Measuring reconstruction error of the feed-forward model versus per-identity optimized baselines on a public multi-view head dataset with held-out identities; substantially higher error on novel subjects would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2605.04035 by Alejandro Blumentals, Artem Sevastopolsky, Brian Amberg, Christian Zimmermann, Dmitry Kostiaev, Evangelos Ntavelis, Fabio Maninchedda, Jeronimo Bayer, Mathias Deschler, Matthias Vestner, Mehak Gupta, Mohamad Shahbazi, Peter Kaufmann, Reinhard Knothe, Sean Wu, Sebastian Martin, Shridhar Ravikumar, Simon Schaefer, Stefan Brugger, Thomas Etterlin, Tom Runia, Trevor Phillips, Vittorio Megaro.

Figure 1
Figure 1. Figure 1: We introduce HeadsUp, a novel feed-forward approach leveraging 3D Gaussians to predict high-quality avatars. By scaling to thousands of subjects and diverse expressions, our method achieves exceptional rendering quality on completely held-out subjects. Notice the accurate, high-resolution recovery of intricate fine details, such as eyelashes, complex earrings, teeth and tongue. The figure displays renders … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of HeadsUp. Our method reconstructs high-fidelity 3D Gaus￾sian heads from multi-view images. Given a set of input views, our model utilizes a transformer-based encoder and a 3D Gaussian decoder to predict UV-parameterized 3D Gaussians for both the foreground and background. The model is trained end-to-end using a combination of photometric and perceptual supervision. to the Gaussian attributes: po… view at source ↗
Figure 3
Figure 3. Figure 3: Visual Comparison on Ava-256. Our method produces sharper reconstruc￾tions with better identity preservation compared to prior work. Increasing the number of views permits reconstruction of details like earrings, hair and skin texture. Addition￾ally, our background model successfully captures intricate head-boundary details that previous foreground-masking techniques typically discard; we only use the back… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study on a single-stage model trained for 500K steps with 10K subjects, 10 input views, 32×32 latent and 256×256 Gaussian UV resolution unless stated otherwise. (a) Training data scaling: log-linear improvement up to 2K sub￾jects, diminishing returns beyond 4K. (b) Input view scaling: quality improves with more views, with diminishing returns after 8. (c) Model capacity: increasing latent resoluti… view at source ↗
Figure 5
Figure 5. Figure 5: Training data scaling. Models trained on fewer subjects fail to generalize to reconstruction of novel identities. At 250 subjects, facial features and hair color deviate significantly from ground truth. The reconstruction quality improves with more training data. On this validation set, the quality improves marginally after 4K subjects. N = 1 N = 2 N = 4 N = 6 N = 8 N = 16 GT view at source ↗
Figure 6
Figure 6. Figure 6: Impact of the number of input views. Reconstruction quality scales naturally with the number of input images. A single frontal view (N = 1) yields blurry results, identity drift, and fails to recover shirt text. However, adding more views progressively resolves these ambiguities, yielding clear improvements in fine details like the teeth and hair. are often combined with foreground elements within masked g… view at source ↗
Figure 7
Figure 7. Figure 7: Downstream Applications. (a) Text-driven identity generation: view at source ↗
read the original abstract

We propose HeadsUp, a scalable feed-forward method for reconstructing high-quality 3D Gaussian heads from large-scale multi-camera setups. Our method employs an efficient encoder-decoder architecture that compresses input views into a compact latent representation. This latent representation is then decoded into a set of UV-parameterized 3D Gaussians anchored to a neutral head template. This UV representation decouples the number of 3D Gaussians from the number and resolution of input images, enabling training with many high-resolution input views. We train and evaluate our model on an internal dataset with more than 10,000 subjects, which is an order of magnitude larger than existing multi-view human head datasets. HeadsUp achieves state-of-the-art reconstruction quality and generalizes to novel identities without test-time optimization. We extensively analyze the scaling behavior of our model across identities, views, and model capacity, revealing practical insights for quality-compute trade-offs. Finally, we highlight the strength of our latent space by showcasing two downstream applications: generating novel 3D identities and animating the 3D heads with expression blendshapes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes HeadsUp, a scalable feed-forward encoder-decoder method for high-quality 3D Gaussian head reconstruction from multi-view captures. Input views are compressed into a compact latent code that is decoded into UV-parameterized 3D Gaussians anchored to a fixed neutral head template; this decouples Gaussian count from input resolution. The model is trained on an internal dataset of >10,000 subjects (an order of magnitude larger than prior multi-view head datasets), claims SOTA reconstruction quality with zero-shot generalization to novel identities, includes scaling analysis across identities/views/capacity, and demonstrates downstream uses in novel identity generation and blendshape animation.

Significance. If substantiated, the work would be significant for enabling efficient, optimization-free 3D head assets at scale, with direct value for AR/VR, animation, and graphics pipelines. The large-scale training regime, explicit scaling study, and UV-based decoupling of representation size from capture resolution are practical strengths; the feed-forward design and latent-space applications further differentiate it from per-subject optimization baselines.

major comments (2)
  1. [Abstract] Abstract: the central claim that HeadsUp 'achieves state-of-the-art reconstruction quality and generalizes to novel identities without test-time optimization' is unsupported by any quantitative metrics (PSNR, SSIM, LPIPS, etc.), baseline comparisons, ablation tables, or error analysis. This is load-bearing for the primary contribution and must be addressed with explicit results tables in the experiments section.
  2. [Dataset and Generalization] Dataset description and generalization analysis: the zero-shot generalization claim rests on the unverified assumption that the internal >10,000-subject dataset provides sufficient coverage of identity variation (cranial proportions, ethnicity, age, capture conditions) and that the fixed neutral-template UV parameterization preserves geometric/appearance detail without significant loss. No diversity statistics, cross-dataset evaluation, or ablation against deformable templates are referenced; if these assumptions fail, both SOTA quality and feed-forward generalization would not hold.
minor comments (2)
  1. [Method] The method description would benefit from explicit notation for the encoder-decoder architecture, latent dimensionality, and the precise UV-to-Gaussian mapping (e.g., how position/scale/rotation attributes compensate for template rigidity).
  2. [Scaling Analysis] Scaling analysis section should include concrete plots or tables showing quality vs. number of identities, views, and model parameters to make the 'practical insights for quality-compute trade-offs' actionable.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback, which helps strengthen the quantitative support for our claims and the analysis of generalization. We will revise the manuscript accordingly and address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that HeadsUp 'achieves state-of-the-art reconstruction quality and generalizes to novel identities without test-time optimization' is unsupported by any quantitative metrics (PSNR, SSIM, LPIPS, etc.), baseline comparisons, ablation tables, or error analysis. This is load-bearing for the primary contribution and must be addressed with explicit results tables in the experiments section.

    Authors: We agree that the abstract claim requires explicit quantitative backing. We will add a dedicated summary table in the experiments section presenting PSNR, SSIM, LPIPS, and related metrics with direct baseline comparisons, plus error analysis to substantiate the state-of-the-art quality and zero-shot generalization without test-time optimization. revision: yes

  2. Referee: [Dataset and Generalization] Dataset description and generalization analysis: the zero-shot generalization claim rests on the unverified assumption that the internal >10,000-subject dataset provides sufficient coverage of identity variation (cranial proportions, ethnicity, age, capture conditions) and that the fixed neutral-template UV parameterization preserves geometric/appearance detail without significant loss. No diversity statistics, cross-dataset evaluation, or ablation against deformable templates are referenced; if these assumptions fail, both SOTA quality and feed-forward generalization would not hold.

    Authors: We acknowledge the need for stronger verification of these assumptions. We will expand the dataset section with available diversity statistics on demographics and capture conditions. We will also add an ablation comparing the fixed neutral UV template to a deformable alternative to demonstrate detail preservation. Cross-dataset evaluation is limited by the lack of other large-scale public multi-view head datasets of similar scope; we will discuss this constraint explicitly in the revision. revision: partial

standing simulated objections not resolved
  • Full cross-dataset quantitative evaluation on comparable large-scale public multi-view head datasets, as no such datasets currently exist.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper presents an empirical machine learning method: a trained encoder-decoder neural architecture that maps multi-view images to UV-parameterized 3D Gaussians on a neutral template. No mathematical derivations, equations, or uniqueness theorems are described that reduce any claimed prediction or result to the inputs by construction. Generalization to novel identities is asserted as an empirical outcome of training on the internal >10k-subject dataset and evaluating on held-out subjects, without any self-definitional fitting or renaming of known patterns. No load-bearing self-citations, smuggled ansatzes, or fitted-input predictions appear in the provided description. The central claims rest on architectural design and data scale rather than circular reductions, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, axioms, or invented entities beyond standard neural network components and the known UV parameterization technique from graphics.

pith-pipeline@v0.9.0 · 5587 in / 1324 out tokens · 53461 ms · 2026-05-11T02:08:50.153345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 4 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    An, S., Xu, H., Shi, Y., Song, G., Ogras, U.Y., Luo, L.: PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360°. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20950–20959 (2023)

  2. [2]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Athar, S., Xu, Z., Sunkavalli, K., Shechtman, E., Shu, Z.: RigNeRF: Fully Con- trollable Neural 3D Portraits. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20364–20373 (2022)

  3. [3]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

    Bhattarai, A.R., Nießner, M., Sevastopolsky, A.: TriPlaneNet: An Encoder for EG3D Inversion. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 3533–3542 (2024)

  4. [4]

    In: SIGGRAPH Asia 2024 Conference Papers

    Bühler, M.C., Li, G., Wood, E., Helminger, L., Chen, X., Shah, T., Wang, D., Garbin, S., Orts-Escolano, S., Hilliges, O., Lagun, D., Riviere, J., Gotardo, P., Beeler, T., Meka, A., Sarkar, K.: Cafca: High-quality Novel View Synthesis of Expressive Faces from Casual Few-shot Captures. In: SIGGRAPH Asia 2024 Conference Papers. ACM (2024).https://doi.org/10....

  5. [5]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L.J., Tremblay, J., Khamis, S., et al.: Efficient Geometry-aware 3D Generative Adversarial Networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16123–16133 (2022)

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Charatan, D., Li, S., Tagliasacchi, A., Sitzmann, V.: pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19457–19467 (2024)

  7. [7]

    arXiv preprint arXiv:2406.06050 (2024)

    Chen, J., Li, C., Zhang, J., Zhu, L., Huang, B., Chen, H., Lee, G.H.: Generalizable Human Gaussians from Single-View Image. arXiv preprint arXiv:2406.06050 (2024)

  8. [8]

    In: European Conference on Computer Vision (ECCV)

    Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J.: MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images. In: European Conference on Computer Vision (ECCV). pp. 370–386. Springer (2024) 16 E. Ntavelis et al

  9. [9]

    NeurIPS (2024)

    Chu, X., Harada, T.: Generalizable and Animatable Gaussian Head Avatar. NeurIPS (2024)

  10. [10]

    arXiv preprint arXiv:2401.10215 (2024)

    Chu, X., Li, Y., Zeng, A., Yang, T., Lin, L., Liu, Y., Harada, T.: GPAvatar: Gener- alizable and Precise Head Avatar from Image(s). arXiv preprint arXiv:2401.10215 (2024)

  11. [11]

    Journal of Machine Learning Research25(70), 1–53 (2024)

    Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al.: Scaling instruction-finetuned language models. Journal of Machine Learning Research25(70), 1–53 (2024)

  12. [12]

    In: SIGGRAPH

    Debevec, P., Hawkins, T., Tchou, C., Duiker, H.P., Sarokin, W.: Acquiring the Reflectance Field of a Human Face. In: SIGGRAPH. New Orleans, LA (Jul 2000), http://ict.usc.edu/pubs/Acquiring%20the%20Re%EF%AC%82ectance%20Field% 20of%20a%20Human%20Face.pdf

  13. [13]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4690–4699 (2019)

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Deng, Y., Wang, D., Ren, X., Chen, X., Wang, B.: Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

  15. [15]

    In: European Conference on Computer Vision (ECCV)

    Deng, Y., Wang, D., Wang, B.: Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer. In: European Conference on Computer Vision (ECCV). pp. 303–321. Springer (2024)

  16. [16]

    ICLR (2021)

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR (2021)

  17. [17]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Gafni, G., Thies, J., Zollhöfer, M., Nießner, M.: Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8649–8658 (2021)

  18. [18]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Giebenhain, S., Kirschstein, T., Georgopoulos, M., Rünz, M., Agapito, L., Nießner, M.: MonoNPHM: Dynamic Head Reconstruction from Monocular Videos. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 21051–21061 (2024)

  19. [19]

    In: Advances in neural information processing systems

    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative Adversarial Networks. In: Advances in neural information processing systems. pp. 2672–2680 (2014),http://papers.nips.cc/ paper/5423-generative-adversarial-nets.pdf

  20. [20]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Gu, Y., Xu, H., Xie, Y., Song, G., Shi, Y., Di, Y., Ye, P., et al.: DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

  21. [21]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

  22. [22]

    In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

    He, Y., Gu, X., Ye, X., Xu, C., Zhao, Z., Dong, Y., Yuan, W., Dong, Z., Bo, L.: LAM: Large Avatar Model for One-shot Animatable Gaussian Head. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–13 (2025)

  23. [23]

    Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021),https: //openreview.net/forum?id=qw8AKxfYbI HeadsUp: Large-scale Gaussian Head Reconstruction 17

  24. [24]

    arXiv preprint arXiv:2311.04400 , year=

    Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: LRM: Large Reconstruction Model for Single Image to 3D. arXiv preprint arXiv:2311.04400 (2023)

  25. [25]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Hoogeboom, E., Mensink, T., Heek, J., Lamerigts, K., Gao, R., Salimans, T.: Simpler diffusion: 1.5 fid on imagenet512 with pixel-space diffusion. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18062–18071 (2025)

  26. [26]

    International Organization for Standardization: ISO 7250-1:2017 Basic human body measurements for technological design — Part 1: Body measurement definitions and landmarks (2017), https://www.iso.org/standard/65246.html , accessed: 2026-02-16

  27. [27]

    arXiv preprint arXiv:2601.13837 (2026)

    Ji, X., Weiss, S., Kansy, M., Naruniec, J., Cao, X., Solenthaler, B., Bradley, D.: FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation. arXiv preprint arXiv:2601.13837 (2026)

  28. [28]

    International Journal of Computer Vision129(12), 3174–3194 (2021)

    Jin, H., Liao, S., Shao, L.: Pixel-in-Pixel Net: Towards Efficient Facial Landmark Detection in the Wild. International Journal of Computer Vision129(12), 3174–3194 (2021)

  29. [29]

    In: European Conference on Computer Vision (2016)

    Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In: European Conference on Computer Vision (2016)

  30. [30]

    In: CVPR

    Kant, Y., Weber, E., Kim, J.K., Khirodkar, R., Zhaoen, S., Martinez, J., Gilitschen- ski, I., Saito, S., Bagautdinov, T.: Pippo: High-Resolution Multi-View Humans from a Single Image. In: CVPR. pp. 16418–16429 (2025)

  31. [31]

    ACM TOG42(4) (July 2023),https: //repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM TOG42(4) (July 2023),https: //repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

  32. [32]

    arXiv preprint arXiv:2206.08343 (2022)

    Khakhulin, T., Sklyarova, V., Lempitsky, V., Zakharov, E.: Realistic One-shot Mesh-based Head Avatars. arXiv preprint arXiv:2206.08343 (2022)

  33. [33]

    In: ECCV

    Khirodkar, R., Bagautdinov, T., Martinez, J., Zhaoen, S., James, A., Selednik, P., Anderson, S., Saito, S.: Sapiens: Foundation for Human Vision Models. In: ECCV. pp. 206–228. Springer (2024)

  34. [34]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P.: Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 (2014)

  35. [35]

    Segment Anything

    Kirillov, A., Mintun, E., Ravi, N., et al.: Segment Anything. arXiv preprint arXiv:2304.02643 (2023)

  36. [36]

    In: SIGGRAPH Asia 2024 Conference Papers

    Kirschstein, T., Giebenhain, S., Tang, J., Georgopoulos, M., Nießner, M.: GGHead: Fast and Generalizable 3D Gaussian Heads. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

  37. [37]

    ACM TOG42(4), 1–14 (2023)

    Kirschstein, T., Qian, S., Giebenhain, S., Walter, T., Nießner, M.: NeRSemble: Multi-view Radiance Field Reconstruction of Human Heads. ACM TOG42(4), 1–14 (2023)

  38. [38]

    Avat3r: Large an- imatable gaussian reconstruction model for high-fidelity 3d head avatars.arXiv preprint arXiv:2502.20220, 2025

    Kirschstein, T., Romero, J., Sevastopolsky, A., Nießner, M., Saito, S.: Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars. arXiv preprint arXiv:2502.20220 (2025)

  39. [39]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (2024)

    Kwon, Y., Fang, B., Lu, Y., Dong, H., Zhang, C., Vicente Carrasco, F., Mosella- Montoro, A., Xu, J., Takagi, S., Kim, D., Prakash, A., De la Torre, F.: Generalizable Human Gaussians for Sparse View Synthesis. In: Proceedings of the European Conference on Computer Vision (ECCV) (2024)

  40. [40]

    Ntavelis et al

    Lawrence, J., Goldman, D.B., Achar, S., Blascovich, G.M., Desloge, J.G., Fortes, T., Gomez, E.M., Häberling, S., Hoppe, H., Huibers, A., Knaus, C., Kuschak, B., Martin-Brualla, R., Nover, H., Russell, A.I., Seitz, S.M., Tong, K.: Project Starline: 18 E. Ntavelis et al. A High-Fidelity Telepresence System. ACM Transactions on Graphics (Proc. of SIGGRAPH As...

  41. [41]

    In: European Conference on Computer Vision (ECCV)

    Li, H., Chen, C., Shi, T., Qiu, Y., An, S., Chen, G., et al.: SphereHead: Stable 3D Full-head Synthesis with Spherical Tri-plane Representation. In: European Conference on Computer Vision (ECCV). Springer (2024)

  42. [42]

    In: SIGGRAPH Asia 2024 Conference Papers

    Li, J., Cao, C., Schwartz, G., Khirodkar, R., Saito, S., et al.: URAvatar: Universal Relightable Gaussian Codec Avatars. In: SIGGRAPH Asia 2024 Conference Papers. ACM (2024)

  43. [43]

    Panolam: Large avatar model for gaussian full- head synthesis from one-shot unposed image.arXiv preprint arXiv:2509.07552, 2025

    Li, P., He, Y., Hu, Y., Dong, Y., Yuan, W., Liu, Y., Zhu, S., Cheng, G., Dong, Z., Guo, Y.: PanoLAM: Large Avatar Model for Gaussian Full-Head Synthesis from One-shot Unposed Image. arXiv preprint arXiv:2509.07552 (2025)

  44. [44]

    In: Proceedings of the computer vision and pattern recognition conference

    Li, P., Zheng, W., Liu, Y., Yu, T., Li, Y., Qi, X., Chi, X., Xia, S., Cao, Y.P., Xue, W., et al.: PSHuman: Photorealistic Single-Image 3D Human Reconstruction Using Cross-Scale Multiview Diffusion and Explicit Remeshing. In: Proceedings of the computer vision and pattern recognition conference. pp. 16008–16018 (2025)

  45. [45]

    ACM TOG36(6), 194–1 (2017)

    Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a Model of Facial Shape and Expression from 4D Scans. ACM TOG36(6), 194–1 (2017)

  46. [46]

    arXiv preprint arXiv:2508.18389 (2025)

    Liang, H., Ge, Z., Tiwari, A., Majee, S., Godaliyadda, G.M.D., Veeraraghavan, A., Balakrishnan, G.: FastAvatar: Instant 3D Gaussian Splatting for Faces from Single Unconstrained Poses. arXiv preprint arXiv:2508.18389 (2025)

  47. [47]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Lin, S., Ryabtsev, A., Sengupta, S., Curless, B.L., Seitz, S.M., Kemelmacher- Shlizerman, I.: Real-Time High-Resolution Background Matting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8762–8771 (2021)

  48. [48]

    ACM Transactions on Graphics (TOG)37(4), 1–13 (2018)

    Lombardi, S., Saragih, J., Simon, T., Sheikh, Y.: Deep Appearance Models for Face Rendering. ACM Transactions on Graphics (TOG)37(4), 1–13 (2018)

  49. [49]

    In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

    Lu, C., Zhou, Y., Bao, F., Chen, J., LI, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems. vol. 35, pp. 5775–5787. Curran Associates, Inc. (2022), https://proceedings.neur...

  50. [50]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Lu, Y., Dong, J., Kwon, Y., Zhao, Q., Dai, B., De la Torre, F.: GAS: Genera- tive Avatar Synthesis from a Single Image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12883–12893 (2025)

  51. [51]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Lyu, W., Zhou, Y., Yang, M.H., Shu, Z.: FaceLift: Learning Generalizable Single Image 3D Face Reconstruction from Synthetic Heads. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12691–12701 (2025)

  52. [52]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Ma, S., Simon, T., Saragih, J., Wang, D., Li, Y., De La Torre, F., Sheikh, Y.: Pixel Codec Avatars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 64–73 (2021)

  53. [53]

    NeurIPS (2024)

    Martinez, J., Kim, E., Romero, J., Bagautdinov, T., Saito, S., Yu, S.I., Anderson, S., Zollhöfer, M., Wang, T.L., Bai, S., Li, C., Wei, S.E., Joshi, R., Borsos, W., Simon, T., Saragih, J., Theodosis, P., Greene, A., Josyula, A., Maeta, S.M., Jewett, A.I., Venshtain, S., Heilman, C., Chen, Y.T., Fu, S., Elshaer, M.E.A., Du, T., Wu, L., Chen, S.C., Kang, K....

  54. [54]

    In: European Conference on Computer Vision (ECCV) (2020)

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In: European Conference on Computer Vision (ECCV) (2020)

  55. [55]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193 (2023)

  56. [56]

    PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing

    Oroz, A., Nießner, M., Kirschstein, T.: PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing. arXiv preprint arXiv:2511.02777 (2025)

  57. [57]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Park, K., Sinha, U., Barron, J.T., Bouaziz, S., Goldman, D.B., Seitz, S.M., Martin- Brualla, R.: Nerfies: Deformable Neural Radiance Fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 5865–5874 (2021)

  58. [58]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Peebles, W., Xie, S.: Scalable Diffusion Models with Transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

  59. [59]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Qian, S., Kirschstein, T., Schoneveld, L., Davoli, D., Giebenhain, S., Nießner, M.: GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

  60. [60]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

    Qiu, L., Gu, X., Li, P., Zuo, Q., Shen, W., Zhang, J., Qiu, K., Yuan, W., Chen, G., Dong, Z., Bo, L.: LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

  61. [61]

    Pf-lhm: 3d animatable avatar reconstruc- tion from pose-free articulated human images.arXiv preprint arXiv:2506.13766, 2025

    Qiu, L., Li, P., Zuo, Q., Gu, X., Dong, Y., Yuan, W., Zhu, S., Han, X., Chen, G., Dong, Z.: PF-LHM: 3D Animatable Avatar Reconstruction from Pose-free Articulated Human Images. arXiv preprint arXiv:2506.13766 (2025)

  62. [62]

    ACM Transactions on Graphics (TOG)42(1), 1–13 (2022)

    Roich, D., Mokady, R., Bermano, A.H., Cohen-Or, D.: Pivotal Tuning for Latent- based Editing of Real Images. ACM Transactions on Graphics (TOG)42(1), 1–13 (2022)

  63. [63]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Saito, S., Schwartz, G., Simon, T., Li, J., Nam, G.: Relightable Gaussian Codec Avatars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 130–141 (2024)

  64. [64]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Shao, Z., Wang, Z., Li, Z., Wang, D., Lin, X., Zhang, Y., Fan, M., Wang, Z.: Splat- tingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16922–16932 (2024)

  65. [65]

    In: Proc

    Sitzmann, V., Rezchikov, S., Freeman, W.T., Tenenbaum, J.B., Durand, F.: Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering. In: Proc. NeurIPS (2021)

  66. [66]

    arXiv preprint arXiv:2303.01416 (2023)

    Skorokhodov, I., Siarohin, A., Xu, Y., Ren, J., Lee, H.Y., Wonka, P., Tulyakov, S.: 3D Generation on ImageNet. arXiv preprint arXiv:2303.01416 (2023)

  67. [67]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018) 20 E

    Sungatullina, D., Zakharov, E., Ulyanov, D., Lempitsky, V.: Image Manipulation with Perceptual Discriminators. In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018) 20 E. Ntavelis et al

  68. [68]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Splatter Image: Ultra-Fast Single-View 3D Reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10208–10217 (2024)

  69. [69]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Tang, J., Davoli, D., Kirschstein, T., Schoneveld, L., Niessner, M.: GAF: Gaus- sian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5546–5558 (2025)

  70. [70]

    ACM Transactions on Graphics (SIGGRAPH Asia) 43(6) (2024)

    Teotia, K., Kim, H., Garrido, P., Habermann, M., Elgharib, M., Theobalt, C.: GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations. ACM Transactions on Graphics (SIGGRAPH Asia) 43(6) (2024)

  71. [71]

    In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

    Teotia, K., Rhodin, H., Mendiratta, M., Kim, H., Habermann, M., Theobalt, C.: Audio Driven Universal Gaussian Head Avatars. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–12 (2025)

  72. [72]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Tran, P., Zakharov, E., Ho, L.N., Tran, A.T., Hu, L., Li, H.: VOODOO 3D: Volumet- ric Portrait Disentanglement for One-Shot 3D Head Reenactment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

  73. [73]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , author =

    Umeyama, S.: Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(4), 376–380 (1991).https://doi.org/10.1109/34.88573

  74. [74]

    In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention Is All You Need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017), https://proceedings.neurips.c...

  75. [75]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) (2025)

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: VGGT: Visual Geometry Grounded Transformer. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) (2025)

  76. [76]

    In: CVPR

    Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: DUSt3R: Geometric 3D Vision Made Easy. In: CVPR. pp. 20697–20709 (2024)

  77. [77]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xiang, J., Gao, X., Guo, Y., Zhang, J.: FlashAvatar: High-fidelity head avatar with efficient gaussian embedding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1802–1812 (2024)

  78. [78]

    In: European Conference on Computer Vision

    Xu, Y., Shi, Z., Yifan, W., Chen, H., Yang, C., Peng, S., Shen, Y., Wetzstein, G.: GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation. In: European Conference on Computer Vision. pp. 1–20. Springer (2024)

  79. [79]

    Advances in Neural Information Processing Systems37, 99601–99645 (2024)

    Xue, Y., Xie, X., Marin, R., Pons-Moll, G.: Human-3Diffusion: Realistic Avatar Cre- ation via Explicit 3D Consistent Diffusion Models. Advances in Neural Information Processing Systems37, 99601–99645 (2024)

  80. [80]

    In: Computer Graphics Forum

    Yang, J., Wu, T., Fogarty, K., Zhong, F., Oztireli, C.: PSHead: 3D Head Reconstruc- tion from a Single Image with Diffusion Prior and Self-Enhancement. In: Computer Graphics Forum. p. e70279. Wiley Online Library (2025)

Showing first 80 references.