pith. machine review for the scientific record. sign in

arxiv: 2605.01450 · v1 · submitted 2026-05-02 · 💻 cs.CV

Recognition: unknown

Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D face reconstructionmulti-view imagesdense correspondenceregistration-freeinverse kinematicspointmap lossestest-time optimization
0
0 comments X

The pith

MOCHI reconstructs 3D face meshes in dense semantic correspondence from multi-view images without any registered training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MOCHI as a framework that predicts 3D face meshes directly from calibrated multi-view images while eliminating the need for registered training data that prior learning methods still required. It achieves this by enforcing mesh consistency through a pseudo-linear inverse kinematic solver and by deriving semantic guidance from dense 2D keypoints produced by a landmark predictor trained only on synthetic data. The authors also replace point-to-surface distances with pointmap- and normal-based losses to stabilize training and introduce a short test-time optimization pass that refines network weights. If the approach holds, high-quality dense 3D face capture becomes feasible using ordinary unregistered image sets rather than labor-intensive manual pipelines.

Core claim

MOCHI eliminates the registration data dependency by enforcing topological consistency through a pseudo-linear inverse kinematic solver. Semantic alignment is guided by dense keypoints from a 2D landmark predictor trained exclusively on synthetic data. Standard point-to-surface distances are shown to induce training instabilities in registration-free settings, so pointmap- and normal-based losses are used instead. A test-time optimization scheme then refines the network over a few dozen iterations, allowing the method to surpass traditional registration pipelines in both reconstruction accuracy and visual quality.

What carries the argument

The pseudo-linear inverse kinematic solver that enforces topological consistency on predicted meshes, guided by dense keypoints from a synthetic-data-trained 2D landmark model, together with pointmap- and normal-based losses that supply stable gradients.

Load-bearing premise

Dense keypoints predicted by a 2D landmark model trained only on synthetic data will give reliable semantic guidance for real multi-view face images, and the pseudo-linear inverse kinematic solver will enforce sufficient topological consistency without new artifacts.

What would settle it

Running MOCHI on a held-out set of real unregistered multi-view face images and observing either frequent topological inconsistencies in the output meshes or semantic misalignments that deviate from ground-truth registered models would disprove the central claims.

Figures

Figures reproduced from arXiv: 2605.01450 by George Retsinas, Panagiotis P. Filntisis, Petros Maragos, Radek Dan\v{e}\v{c}ek, Timo Bolkart, Vanessa Sklyarova.

Figure 1
Figure 1. Figure 1: MOCHI: Registration-free multi-view face capture in dense correspondence. Given calibrated multi-view images MOCHI predicts a canonical FLAME-topology mesh without needing precomputed registrations during training. When a target scan is available at test-time, an optional brief TTO of the model sharpens details and outperforms classical, manually labor-heavy registration pipelines. Abstract Recent framewor… view at source ↗
Figure 2
Figure 2. Figure 2: Traditional capturing systems (right) rely on a multi￾stage approach involving MVS to produce a head scan, followed by registration that requires manual supervision. Recent learning￾based systems (left) simplify the workflow but still require GT registrations from traditional systems for training (red arrow). MOCHI relies only on the fully automatic parts (green arrow), out￾performing previous learning-bas… view at source ↗
Figure 3
Figure 3. Figure 3: Given calibrated multi-view images, MOCHI predicts the vertex positions view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on FaMoS (8 views). From Left to Right: TEMPEH [9], MOCHI, MOCHI + TTO (50 iters), Classic Registration, and the Scan. For each method we show the predicted mesh (left) and the point-to-surface error heatmap on the scan (right; red ≥ 1.0 mm, lower is better). MOCHI improves over TEMPEH on unseen subjects even without using registrations for training; adding TTO further reduces error and… view at source ↗
Figure 5
Figure 5. Figure 5: Effect of loss type/weight. Training the coarse model with point maps (top) and point-to-surface loss (bottom). When starting from a very coarse prediction using only landmark pre￾training, increasing the point-to-surface loss weight leads to arti￾facts, whereas point-map supervision remains stable. Blue num￾bers show the point-to-surface error (mm). of using only dense 2D landmarks versus using landmarks … view at source ↗
Figure 6
Figure 6. Figure 6: Effect of inverse parameter recovery. Left: only 2D landmark regularization. Right: 2D landmarks + PLIKS. While 2D landmarks alone can approximate the facial topology, sensitive areas such as eyes, nose, and ears exhibit artifacts and noisy trian￾gles. Adding PLIKS enforces anatomical consistency. topology resulting in reconstructions with self-intersecting mesh triangles and severe noise in sensitive regi… view at source ↗
Figure 7
Figure 7. Figure 7: TTO ablation on the FaMoS validation set. We ablate (1) the number of TTO steps, (2) training without normals, and (3) training with the point-to-surface loss. Best results are obtained when using with both point-map and normal losses. The black star shows classic registration metrics. Full table in Sup. Mat. suboptimal reconstruction. More extensive ablation studies and the full table can also be found in… view at source ↗
Figure 9
Figure 9. Figure 9: Impact of hair caps on ground-truth scans. We show example input images and their corresponding raw scans from the FaMoS dataset. Depending on the fit of the hair cap, the resulting scans exhibit significant distortion in the scalp region, often in￾cluding the geometry of the cap itself. As a result, this region does not faithfully represent the subject’s actual scalp and is excluded from our quantitative … view at source ↗
Figure 8
Figure 8. Figure 8: Multi-view input setup. Example views from the 8 RGB cameras used in the FaMoS dataset. We utilize only these RGB streams for both MOCHI and the TEMPEH baseline. Exclusion of Scalp Region from Evaluation In the quantitative evaluation presented in the main text, we ex￾plicitly exclude the scalp region from our evaluation pro￾tocol. This exclusion is necessary because subjects in the FaMoS dataset wore hair… view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative results on FaMoS validation and test sets (we show 4 of 8 input views). From Left to Right: TEMPEH [9], MOCHI, MOCHI + TTO (50 iters), Classic Registration, and the Scan. For each method we show the predicted mesh (left) and the point￾to-surface error heatmap on the scan (right; red ≥ 1.0 mm, lower is better). MOCHI improves over TEMPEH on unseen subjects even without using registrations for t… view at source ↗
Figure 12
Figure 12. Figure 12: Robustness of optimization strategies. We compare our latent-space optimization (MOCHI TTO) against directly optimizing mesh vertices (Direct Opt) at 50 and 100 iterations. While Direct Opt reduces the numerical point-to-surface error, it is prone to artifacts, spikes, and surface irregularities (highlighted by red boxes). In contrast, MOCHI TTO leverages the network’s learned prior, ensuring the reconstr… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative progression of Test-Time Optimization. This figure shows the evolution of mesh refinement over 100 iterations on example images from FaMoS test set view at source ↗
Figure 1
Figure 1. Figure 1: Example results on FaceScape: MOCHI, MOCHI TTO, and the official registration view at source ↗
Figure 14
Figure 14. Figure 14: Impact of loss type and weight. We visualize the training of the coarse model using point maps (top), point-to￾surface loss (middle), and Chamfer loss (bottom). When starting from a coarse prediction initialized only with landmark pretrain￾ing, increasing the point-to-surface loss weight leads to severe arti￾facts. While Chamfer distance results in fewer artifacts, it can still introduce errors (e.g., in … view at source ↗
Figure 15
Figure 15. Figure 15: Failure cases. Top Row (MOCHI TTO): Since the FLAME topology lacks explicit teeth geometry, test-time opti￾mization may incorrectly distort the lips by pulling them inward to minimize the point-to-surface distance to the teeth present in the raw scan. Bottom Row (MOCHI): In cases of large head poses, the base model may occasionally present with artifacts. (1) fitting 3D morphable models (3DDMs) to in-the-… view at source ↗
Figure 17
Figure 17. Figure 17: Synthetic data. From left to right: Path-traced RGB image, normal map, uv-map, binary face mask. We encourage the reader to zoom in on this figure in its digital form. looking at the center of the face. Furthermore, we perturb the camera distance and the focal length. Output. We use Blender [16] to render all the images. First, the Cycles path tracer renders the RGB image of the scene generated with the p… view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative evaluation of the 2D landmark tracker on the FaMoS dataset. From left to right: the input RGB image, the Ground Truth landmarks (obtained by projecting the available registration), and the landmarks predicted by our tracker. The ver￾tices are color-coded by semantic region (e.g., green/red for eyes, pink for lips, blue for ears). Despite being trained exclusively on synthetic data, the tracker… view at source ↗
read the original abstract

Recent frameworks like ToFu and TEMPEH provide an automated alternative to classical registration pipelines by predicting 3D meshes in dense semantic correspondence directly from calibrated multi-view images. However, these learning-based methods rely on the slow, manual registration pipelines they aim to replace for their training supervision. We overcome this limitation with MOCHI (Multi-view Optimizable Correspondence of Heads from Images), a multi-view 3D face prediction framework trained without requiring registered training data. MOCHI eliminates the registration data dependency by enforcing topological consistency through a pseudo-linear inverse kinematic solver. Semantic alignment is guided by dense keypoints from a 2D landmark predictor trained exclusively on synthetic data. Our analysis further reveals that standard point-to-surface distances induce training instabilities and visual artifacts in registration-free settings. We propose pointmap- and normal-based losses instead, which provide smoother gradients and superior reconstruction fidelity. Finally, we introduce a test-time optimization scheme that refines network weights over a few dozen iterations. This approach bridges the gap between feed-forward efficiency and iterative optimization precision, allowing MOCHI to outperform traditional labor-intensive pipelines in both reconstruction accuracy and visual quality. Code and model are public at: https://filby89.github.io/mochi.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MOCHI, a multi-view 3D face mesh prediction framework that achieves dense semantic correspondence without registered training data. It enforces topological consistency via a pseudo-linear inverse kinematic solver, guides semantic alignment using dense keypoints from a 2D landmark predictor trained exclusively on synthetic data, replaces unstable point-to-surface losses with pointmap- and normal-based losses, and adds a test-time optimization scheme over a few dozen iterations. The authors claim this allows MOCHI to outperform traditional labor-intensive registration pipelines in reconstruction accuracy and visual quality, with public code and models released.

Significance. If the central claims hold—particularly effective synthetic-to-real transfer for keypoints and stable training via the new losses and solver—MOCHI would be a meaningful advance in 3D face capture by removing the registration data bottleneck that limits scalability of prior learning-based methods like ToFu and TEMPEH. The public code release supports reproducibility and is a clear strength.

major comments (2)
  1. [Abstract] The registration-free claim rests on the 2D landmark predictor (trained only on synthetic data) supplying reliable dense semantic guidance for real multi-view images. No quantitative keypoint error metrics on real data, domain-shift analysis, or ablation removing the synthetic predictor are referenced in the abstract, leaving the superiority over registered pipelines conditional on an unverified generalization assumption.
  2. [Abstract] The pseudo-linear inverse kinematic solver is presented as the mechanism for enforcing topological consistency without registration, yet the abstract provides no equations, algorithm, or comparison to standard IK methods. This makes it impossible to assess whether it avoids new artifacts or simply shifts instabilities, which is load-bearing for the central contribution.
minor comments (1)
  1. [Abstract] The abstract states that 'standard point-to-surface distances induce training instabilities' and proposes alternatives, but does not indicate in which section the supporting analysis or gradient visualizations appear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the abstract accordingly to better substantiate the key claims while respecting length constraints.

read point-by-point responses
  1. Referee: [Abstract] The registration-free claim rests on the 2D landmark predictor (trained only on synthetic data) supplying reliable dense semantic guidance for real multi-view images. No quantitative keypoint error metrics on real data, domain-shift analysis, or ablation removing the synthetic predictor are referenced in the abstract, leaving the superiority over registered pipelines conditional on an unverified generalization assumption.

    Authors: We agree that the abstract would benefit from explicitly referencing the supporting evidence for the synthetic-to-real transfer. The full manuscript contains quantitative keypoint error metrics on real data, domain-shift analysis, and ablations that remove the synthetic predictor to demonstrate its contribution. We will revise the abstract to include a concise reference to these results, thereby strengthening the registration-free claim without altering the manuscript's technical content. revision: yes

  2. Referee: [Abstract] The pseudo-linear inverse kinematic solver is presented as the mechanism for enforcing topological consistency without registration, yet the abstract provides no equations, algorithm, or comparison to standard IK methods. This makes it impossible to assess whether it avoids new artifacts or simply shifts instabilities, which is load-bearing for the central contribution.

    Authors: The abstract serves as a high-level overview, while the methods section provides the full equations, algorithmic details, and comparisons to standard IK solvers that demonstrate how the pseudo-linear formulation avoids instabilities and artifacts. To address the concern, we will update the abstract with a brief phrase highlighting the solver's stability properties and its role in registration-free training. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external components and new proposed elements

full rationale

The paper's core derivation introduces MOCHI as a registration-free framework that enforces topological consistency via a newly described pseudo-linear inverse kinematic solver and guides semantic alignment using dense keypoints from a separately trained 2D landmark predictor on synthetic data only. It further proposes pointmap- and normal-based losses to replace point-to-surface distances and adds test-time optimization. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described chain; the landmark predictor and solver are presented as independent inputs rather than derived from the target multi-view data. The central performance claims rest on these external and novel elements rather than reducing to the inputs by construction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the effectiveness of the newly introduced pseudo-linear IK solver and the generalization of the synthetic-trained 2D predictor; these are not derived from first principles but postulated as working components.

axioms (1)
  • domain assumption Dense keypoints from a 2D landmark predictor trained exclusively on synthetic data provide sufficient semantic alignment for real multi-view face images.
    This assumption is invoked to replace the need for registered real data.
invented entities (1)
  • pseudo-linear inverse kinematic solver no independent evidence
    purpose: Enforce topological consistency in predicted 3D face meshes without registration.
    New component introduced to solve the registration dependency.

pith-pipeline@v0.9.0 · 5542 in / 1511 out tokens · 70977 ms · 2026-05-09T15:04:06.937190+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

96 extracted references · 8 canonical work pages · 1 internal anchor

  1. [1]

    Ex- pression invariant 3d face recognition with a morphable model.2008 8th IEEE International Conference on Auto- matic Face & Gesture Recognition, pages 1–6, 2008

    Brian Amberg, Reinhard Knothe, and Thomas Vetter. Ex- pression invariant 3d face recognition with a morphable model.2008 8th IEEE International Conference on Auto- matic Face & Gesture Recognition, pages 1–6, 2008. 2

  2. [2]

    Scaffoldavatar: High-fidelity gaussian avatars with patch expressions.CoRR, abs/2507.10542, 2025

    Shivangi Aneja, Sebastian Weiss, Irene Baeza, Prashanth Chandran, Gaspard Zoss, Matthias Nießner, and Derek Bradley. Scaffoldavatar: High-fidelity gaussian avatars with patch expressions.CoRR, abs/2507.10542, 2025. 3

  3. [3]

    Bronstein, and Stefanos Zafeiriou

    Mehdi Bahri, Eimear O’ Sullivan, Shunwang Gong, Feng Liu, Xiaoming Liu, Michael M. Bronstein, and Stefanos Zafeiriou. Shape my face: Registering 3d face scans by surface-to-surface translation.Int. J. Comput. Vision, 129 (9):2680–2713, 2021. 2

  4. [4]

    High-quality single-shot capture of facial geometry.ACM Trans

    Thabo Beeler, Bernd Bickel, Paul Beardsley, Bob Sumner, and Markus Gross. High-quality single-shot capture of facial geometry.ACM Trans. Graph., 29(4), 2010. 2

  5. [5]

    Sumner, and Markus Gross

    Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman, Robert W. Sumner, and Markus Gross. High-quality passive facial performance cap- ture using anchor frames.ACM Trans. Graph., 30(4), 2011. 2

  6. [6]

    Besl and Neil D

    Paul J. Besl and Neil D. McKay. A method for registration of 3-d shapes.IEEE Trans. Pattern Anal. Mach. Intell., 14 (2):239–256, 1992. 2

  7. [7]

    A morphable model for the synthesis of 3d faces

    V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. InProceedings of SIGGRAPH, 1999. 2, 3

  8. [8]

    Face identification across different poses and illuminations with a 3D morphable model

    V olker Blanz, Sami Romdhani, and Thomas Vetter. Face identification across different poses and illuminations with a 3D morphable model. InIEEE International Conference on Automatic Face and Gesture Recognition, pages 202–207,

  9. [9]

    Timo Bolkart, Tianye Li, and Michael J. Black. Instant multi-view head capture through learnable registration. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2023. 2, 3, 4, 5, 6, 7, 13, 14, 15

  10. [10]

    James Booth, Anastasios Roussos, Stefanos Zafeiriou, Al- lan Ponniah, and David J. Dunaway. A 3d morphable model learnt from 10, 000 faces. In2016 IEEE Conference on Com- puter Vision and Pattern Recognition, CVPR 2016, Las Ve- gas, NV , USA, June 27-30, 2016, pages 5543–5552. IEEE Computer Society, 2016. 2

  11. [11]

    3d morphable models: Past, present, and future.arXiv preprint arXiv:1803.09725,

    James Booth, Anastasios Roussos, Arun Ponniah, David Dunaway, and Stefanos Zafeiriou. 3d morphable models: Past, present, and future.arXiv preprint arXiv:1803.09725,

  12. [12]

    Dunaway, and Stefanos Zafeiriou

    James Booth, Anastasios Roussos, Allan Ponniah, David J. Dunaway, and Stefanos Zafeiriou. Large scale 3d morphable models.Int. J. Comput. Vis., 126(2-4):233–254, 2018. 2

  13. [13]

    High resolution passive facial performance capture

    Derek Bradley, Wolfgang Heidrich, Tiberiu Popa, and Alla Sheffer. High resolution passive facial performance capture. ACM Trans. Graph., 29(4), 2010. 2

  14. [14]

    Real-time 3d face tracking in the wild with a single rgb camera

    Chen Cao, Yanlin Weng, Steve Lin Zhou, Yiying Tong, and Kun Zhou. Real-time 3d face tracking in the wild with a single rgb camera. InACM Transactions on Graphics (TOG),

  15. [15]

    Facewarehouse: A 3d facial expression database for visual computing.IEEE TVCG, 20(3):413–425, 2014

    Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. Facewarehouse: A 3d facial expression database for visual computing.IEEE TVCG, 20(3):413–425, 2014. 2

  16. [16]

    Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018

    Blender Online Community.Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018. 18

  17. [17]

    Black, and Timo Bolkart

    Radek Dane ˇcek, Michael J. Black, and Timo Bolkart. Emoca: Emotion driven monocular face capture and anima- tion. InCVPR, pages 20279–20290, 2022. 3

  18. [18]

    RetinaFace: Single-shot multi-level face localisation in the wild

    Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. RetinaFace: Single-shot multi-level face localisation in the wild. InCVPR, pages 5202–5211,

  19. [19]

    An image is worth 16x16 words: Trans- formers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Con- ference on Learning Representations (ICLR), 2021. 19

  20. [20]

    End-to-end 3D face reconstruction with deep neural net- works

    Pengfei Dou, Shishir K Shah, and Ioannis A Kakadiaris. End-to-end 3D face reconstruction with deep neural net- works. InCVPR, pages 5908–5917, 2017. 3

  21. [21]

    Bernhard Egger, William A. P. Smith, Ayush Tewari, Ste- fanie Wuhrer, Michael Zollh ¨ofer, Thabo Beeler, Florian Bernard, Timo Bolkart, Adam Kortylewski, Sami Romdhani, Christian Theobalt, V olker Blanz, and Thomas Vetter. 3d morphable face models - past, present, and future.ACM Trans. Graph., 39(5):157:1–157:38, 2020. 2, 3

  22. [22]

    Silhouette and stereo fusion for 3d object modeling.Computer Vision and Image Understanding, 96(3):367–392, 2004

    Carlos Hern ´andez Esteban and Francis Schmitt. Silhouette and stereo fusion for 3d object modeling.Computer Vision and Image Understanding, 96(3):367–392, 2004. 2

  23. [23]

    Joint 3D face reconstruction and dense alignment with position map regression network

    Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. Joint 3D face reconstruction and dense alignment with position map regression network. InECCV, pages 534–551,

  24. [24]

    Black, and Timo Bolkart

    Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. Learning an animatable detailed 3d face model from in-the-wild images.ACM TOG, 40(4):88:1–88:13, 2021. 3

  25. [25]

    Filntisis, George Retsinas, Foivos Paraperas- Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos

    Panagiotis P. Filntisis, George Retsinas, Foivos Paraperas- Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos. Visual speech-aware perceptual 3d fa- cial expression reconstruction from videos.arXiv preprint arXiv:2207.11094, 2022. 3

  26. [26]

    Accurate, dense, and robust multi-view stereopsis

    Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multi-view stereopsis. InIEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2010. 2

  27. [27]

    Multi- view stereo on consistent face topology.Comput

    Graham Fyffe, Koki Nagano, Loc Huynh, Shunsuke Saito, Jay Busch, Andrew Jones, Hao Li, and Paul Debevec. Multi- view stereo on consistent face topology.Comput. Graph. Forum, 36(2):295–309, 2017. 2

  28. [28]

    Reconstructing person- alized semantic facial nerf models from monocular video

    Xuan Gao, Chenglai Zhong, Jun Xiang, Yang Hong, Yudong Guo, and Juyong Zhang. Reconstructing person- alized semantic facial nerf models from monocular video. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia), 41(6), 2022. 3

  29. [29]

    Statistical methods for tomographic image restoration.Bull

    Stuart Geman. Statistical methods for tomographic image restoration.Bull. Internat. Statist. Inst., 52:5–21, 1987. 4

  30. [30]

    Morphable face models-an open framework

    Thomas Gerig, Andreas Morel-Forster, Clemens Blumer, Bernhard Egger, Marcel Luthi, Sandro Sch ¨onborn, and Thomas Vetter. Morphable face models-an open framework. InIEEE International Conference on Automatic Face and Gesture Recognition, pages 75–82, 2018. 2

  31. [31]

    Multiview face cap- ture using polarized spherical gradient illumination

    Abhijeet Ghosh, Graham Fyffe, Ben Tunwattanapong, Jay Busch, Xueming Yu, and Paul Debevec. Multiview face cap- ture using polarized spherical gradient illumination. InACM Transactions on Graphics (TOG), 2011. 2

  32. [32]

    Learning neural parametric head models

    Simon Giebenhain, Tobias Kirschstein, Markos Georgopou- los, Martin R ¨unz, Lourdes Agapito, and Matthias Nießner. Learning neural parametric head models. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),

  33. [33]

    Npga: Neural paramet- ric gaussian avatars

    Simon Giebenhain, Tobias Kirschstein, Martin R ¨unz, Lour- des Agapito, and Matthias Nießner. Npga: Neural paramet- ric gaussian avatars. InSIGGRAPH Asia 2024 Conference Papers (SA Conference Papers ’24), December 3-6, Tokyo, Japan, 2024. 3

  34. [34]

    Goesele, B

    M. Goesele, B. Curless, and S.M. Seitz. Multi-view stereo revisited. In2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pages 2402–2409, 2006. 2

  35. [35]

    DenseReg: Fully convolutional dense shape regression in-the-wild

    Riza Alp G ¨uler, George Trigeorgis, Epameinondas Anton- akos, Patrick Snape, Stefanos Zafeiriou, and Iasonas Kokki- nos. DenseReg: Fully convolutional dense shape regression in-the-wild. InCVPR, pages 6799–6808, 2017. 3

  36. [36]

    Towards fast, accurate and stable 3D dense face alignment

    Jianzhu Guo, Xiangyu Zhu, Yang Yang, Fan Yang, Zhen Lei, and Stan Z Li. Towards fast, accurate and stable 3D dense face alignment. InECCV, pages 152–168, 2020. 3

  37. [37]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations (ICLR),

  38. [38]

    Dpsnet: End-to-end deep plane sweep stereo

    Sunghoon Im, Hae-Gon Jeon, Stephen Lin, and In So Kweon. Dpsnet: End-to-end deep plane sweep stereo. InIn- ternational Conference on Learning Representations (ICLR),

  39. [39]

    Large pose 3D face reconstruction from a single image via direct volumetric CNN regression

    Aaron S Jackson, Adrian Bulat, Vasileios Argyriou, and Georgios Tzimiropoulos. Large pose 3D face reconstruction from a single image via direct volumetric CNN regression. InICCV, pages 1031–1039, 2017. 3

  40. [40]

    Kaolin: A pytorch library for accelerating 3d deep learning research.arXiv preprint arXiv:1911.05063, 2019

    Krishna Murthy Jatavallabhula, Edward Smith, Jean- Francois Lafleche, Clement Fuji Tsang, Artem Rozantsev, Wenzheng Chen, Tommy Xiang, Rev Lebaredian, and Sanja Fidler. Kaolin: A pytorch library for accelerating 3d deep learning research.arXiv preprint arXiv:1911.05063, 2019. 8

  41. [41]

    Learning free-form deformation for 3D face reconstruction from in-the-wild images

    Harim Jung, Myeong-Seok Oh, and Seong-Whan Lee. Learning free-form deformation for 3D face reconstruction from in-the-wild images. InIEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 2737–2742,

  42. [42]

    Learn- ing a multi-view stereo machine

    Abhishek Kar, Christian H ¨ane, and Jitendra Malik. Learn- ing a multi-view stereo machine. InAdvances in Neural Information Processing Systems (NeurIPS), pages 365–376,

  43. [43]

    Multi-camera scene reconstruction via graph cuts

    Vladimir Kolmogorov and Ramin Zabih. Multi-camera scene reconstruction via graph cuts. InEuropean Conference on Computer Vision (ECCV), pages 82–96. Springer, 2002. 2

  44. [44]

    The digital michelangelo project: 3d scanning of large statues

    Marc Levoy, Kari Pulli, Brian Curless, Szymon Rusinkiewicz, David Koller, Lucas Pereira, Matt Ginz- ton, Sean Anderson, James Davis, Jeremy Ginsberg, Jonathan Shade, and Duane Fulk. The digital michelangelo project: 3d scanning of large statues. InProceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, page 131–144, USA,...

  45. [45]

    Guibas, and Mark Pauly

    Hao Li, Bart Adams, Leonidas J. Guibas, and Mark Pauly. Robust single-view geometry and motion reconstruction. ACM Transactions on Graphics (Proceedings SIGGRAPH Asia 2009), 28(5), 2009. 2

  46. [46]

    Grape: Generaliz- able and robust multi-view facial capture.arXiv preprint arXiv:2407.10193, 2024

    Jing Li, Di Kang, and Zhenyu He. Grape: Generaliz- able and robust multi-view facial capture.arXiv preprint arXiv:2407.10193, 2024. 2, 3, 6, 14

  47. [47]

    Learn- ing formation of physically-based face attributes

    Ruilong Li, Karl Bladin, Yajie Zhao, Chinmay Chinara, Owen Ingraham, Pengda Xiang, Xinglei Ren, Pratusha Prasad, Bipin Kishore, Jun Xing, and Hao Li. Learn- ing formation of physically-based face attributes. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 3407–3416. Computer Visio...

  48. [48]

    Black, Hao Li, and Javier Romero

    Tianye Li, Timo Bolkart, Michael J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expres- sion from 4d scans. InACM SIGGRAPH Asia 2017, 2017. 2, 3, 5, 6, 13

  49. [49]

    Topologically consistent multi-view face inference using volumetric sampling

    Tianye Li, Shichen Liu, Timo Bolkart, Jiayi Liu, Hao Li, and Yajie Zhao. Topologically consistent multi-view face inference using volumetric sampling. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 2, 3

  50. [50]

    3d face modeling from diverse raw scan data.2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9407–9417,

    Feng Liu, Luan Tran, and Xiaoming Liu. 3d face modeling from diverse raw scan data.2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9407–9417,

  51. [51]

    Rapid face asset acquisition with recurrent fea- ture alignment.ACM TOG, 41(6):214:1–214:17, 2022

    Shichen Liu, Yunxuan Cai, Haiwei Chen, Yichao Zhou, and Yajie Zhao. Rapid face asset acquisition with recurrent fea- ture alignment.ACM TOG, 41(6):214:1–214:17, 2022. 3

  52. [52]

    Neural vol- umes: Learning dynamic renderable volumes from images

    Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol- umes: Learning dynamic renderable volumes from images. ACM Trans. Graph., 38(4):65:1–65:14, 2019. 3

  53. [53]

    Mix- ture of volumetric primitives for efficient neural rendering

    Stephen Lombardi, Tomas Simon, Gabriel Schwartz, Michael Zollhoefer, Yaser Sheikh, and Jason Saragih. Mix- ture of volumetric primitives for efficient neural rendering. ACM Trans. Graph., 40(4), 2021. 3

  54. [54]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations (ICLR), 2019. 19

  55. [55]

    Saragih, Dawei Wang, Yuecheng Li, Fernando De la Torre, and Yaser Sheikh

    Shugao Ma, Tomas Simon, Jason M. Saragih, Dawei Wang, Yuecheng Li, Fernando De la Torre, and Yaser Sheikh. Pixel codec avatars.2021 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 64–73, 2021. 3

  56. [56]

    Rapid acquisition of specular and diffuse normal maps from polarized spher- ical gradient illumination

    Wan-Chun Ma, Tim Hawkins, Pieter Peers, Charles-Felix Chabert, Malte Weiss, and Paul Debevec. Rapid acquisition of specular and diffuse normal maps from polarized spher- ical gradient illumination. InEurographics conference on Rendering Techniques, page 183–194, Goslar, DEU, 2007. Eurographics Association. 2

  57. [57]

    Newcombe, and Steven Lovegrove

    Jeong Joon Park, Peter Florence, Julian Straub, Richard A. Newcombe, and Steven Lovegrove. DeepSDF: Learning continuous signed distance functions for shape representa- tion. InCVPR, pages 165–174, 2019. 3

  58. [58]

    Kakadiaris

    Georgios Passalis, Panagiotis Perakis, Theoharis Theoharis, and Ioannis A. Kakadiaris. Using facial symmetry to han- dle pose variations in real-world 3d face recognition.IEEE Trans. Pattern Anal. Mach. Intell., 33(10):1938–1951, 2011. 2

  59. [59]

    A 3d face model for pose and illumination invariant face recognition

    Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. A 3d face model for pose and illumination invariant face recognition. InIEEE Inter- national Conference on Advanced Video and Signal Based Surveillance (AVSS), 2009. 2

  60. [60]

    Pears, William A

    Stylianos Ploumpis, Evangelos Ververas, Eimear O’ Sulli- van, Stylianos Moschoglou, Haoyang Wang, Nick E. Pears, William A. P. Smith, Baris Gecer, and Stefanos Zafeiriou. Towards a complete 3D morphable model of the human head. IEEE TPAMI, 43(11):4142–4160, 2021. 3

  61. [61]

    Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and Michael J. Black. Generating 3d faces using convolutional mesh autoencoders. InComputer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part III, pages 725–741. Springer, 2018. 6

  62. [62]

    Filntisis, Radek Dane ˇcek, Victoria F

    George Retsinas, Panagiotis P. Filntisis, Radek Dane ˇcek, Victoria F. Abrevaya, Anastasios Roussos, Timo Bolkart, and Petros Maragos. 3d facial expressions through analysis- by-neural-synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

  63. [63]

    Single-shot high-quality facial geometry and skin appearance capture.ACM Transactions on Graphics, 39, 2020

    Jeremy Riviere, Paulo Gotardo, Derek Bradley, Abhijeet Ghosh, and Thabo Beeler. Single-shot high-quality facial geometry and skin appearance capture.ACM Transactions on Graphics, 39, 2020. 2

  64. [64]

    SADRNet: Self-aligned dual face regres- sion networks for robust 3D dense face alignment and recon- struction.Transactions on Image Processing, 30:5793–5806,

    Zeyu Ruan, Changqing Zou, Longhai Wu, Gangshan Wu, and Limin Wang. SADRNet: Self-aligned dual face regres- sion networks for robust 3D dense face alignment and recon- struction.Transactions on Image Processing, 30:5793–5806,

  65. [65]

    Fully automatic expression-invariant face correspondence.Machine Vision and Applications, 25: 859 – 879, 2013

    Augusto Salazar, Stefanie Wuhrer, Chang Shu, and Flavio Augusto Prieto. Fully automatic expression-invariant face correspondence.Machine Vision and Applications, 25: 859 – 879, 2013. 2

  66. [66]

    Learning to regress 3d face shape and expression from an image without 3d supervision

    Soubhik Sanyal, Timo Bolkart, Haiwen Feng, and Michael Black. Learning to regress 3d face shape and expression from an image without 3d supervision. InProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019. 3

  67. [67]

    Sch ¨onberger and Jan-Michael Frahm

    Johannes L. Sch ¨onberger and Jan-Michael Frahm. Structure- from-motion revisited. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

  68. [68]

    Unre- stricted facial geometry reconstruction using image-to-image translation

    Matan Sela, Elad Richardson, and Ron Kimmel. Unre- stricted facial geometry reconstruction using image-to-image translation. InICCV, pages 1576–1585, 2017. 3

  69. [69]

    Self-supervised monocular 3d face reconstruction by occlusion-aware multi- view geometry consistency

    Jiaxiang Shang, Tianwei Shen, Shiwei Li, Lei Zhou, Ming- min Zhen, Tian Fang, and Long Quan. Self-supervised monocular 3d face reconstruction by occlusion-aware multi- view geometry consistency. InComputer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XV, pages 53–70. Springer,

  70. [70]

    Pliks: A pseudo-linear inverse kinematic solver for 3d hu- man body estimation

    Karthik Shetty, Annette Birkhold, Srikrishna Jaganathan, Norbert Strobel, Markus Kowarschik, and Bernhard Egger. Pliks: A pseudo-linear inverse kinematic solver for 3d hu- man body estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2, 3, 5

  71. [71]

    Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

  72. [72]

    Black, and Justus Thies

    Vanessa Sklyarova, Egor Zakharov, Otmar Hilliges, Michael J. Black, and Justus Thies. Text-conditioned gen- erative model of 3d strand-based human hairstyles. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4703–4712, 2024. 4, 17

  73. [73]

    Unsuper- vised generative 3D shape learning from natural images

    Attila Szab ´o, Givi Meishvili, and Paolo Favaro. Unsuper- vised generative 3D shape learning from natural images. CoRR, abs/1910.00287, 2019. 3

  74. [74]

    3D face tracking from 2D video through iterative dense UV to image flow

    Felix Taubner, Prashant Raina, Mathieu Tuli, Eu Wern Teh, Chul Lee, and Jinmiao Huang. 3D face tracking from 2D video through iterative dense UV to image flow. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1227–1237, 2024. 3

  75. [75]

    Mofa: Model-based deep convolutional face au- toencoder for unsupervised monocular reconstruction

    Ayush Tewari, Michael Zollh ¨ofer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard, Patrick P ´erez, and Christian Theobalt. Mofa: Model-based deep convolutional face au- toencoder for unsupervised monocular reconstruction. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 3735–3744. IEEE Computer Soci...

  76. [76]

    Self-supervised multi-level face model learn- ing for monocular reconstruction at over 250 hz

    Ayush Tewari, Michael Zollh ¨ofer, Pablo Garrido, Florian Bernard, Hyeongwoo Kim, Patrick P ´erez, and Christian Theobalt. Self-supervised multi-level face model learn- ing for monocular reconstruction at over 250 hz. In2018 IEEE Conference on Computer Vision and Pattern Recog- nition, CVPR 2018, Salt Lake City, UT, USA, June 18- 22, 2018, pages 2549–2559...

  77. [77]

    Tewari, O

    A. Tewari, O. Fried, J. Thies, V . Sitzmann, S. Lombardi, K. Sunkavalli, R. Martin-Brualla, T. Simon, J. Saragih, M. Nießner, R. Pandey, S. Fanello, G. Wetzstein, J.-Y . Zhu, C. Theobalt, M. Agrawala, E. Shechtman, D. B. Goldman, and M. Zollh¨ofer. State of the art on neural rendering.EG, 2020. 2, 3

  78. [78]

    Thies, M

    J. Thies, M. Zollh ¨ofer, M. Stamminger, C. Theobalt, and M. Nießner. Face2Face: Real-time Face Capture and Reenact- ment of RGB Videos. InProc. Computer Vision and Pattern Recognition (CVPR), IEEE, pages 2387–2395, 2016. 3

  79. [79]

    Vetter and V

    T. Vetter and V . Blanz. Estimating coloured 3D face models from single images: An example based approach. InECCV,

  80. [80]

    George V ogiatzis, Philip H. S. Torr, and Roberto Cipolla. Multi-view stereo via volumetric graph-cuts. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 391–398. IEEE, 2005. 2

Showing first 80 references.