arxiv: 2604.02320 · v2 · submitted 2026-04-02 · 💻 cs.CV · cs.GR

Recognition: no theorem link

Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

Junxuan Li , Rawal Khirodkar , Chengan He , Zhongshi Jiang , Giljoo Nam , Lingchen Yang , Jihyun Lee , Egor Zakharov

show 32 more authors

Zhaoen Su Rinat Abdrashitov Yuan Dong Julieta Martinez Kai Li Qingyang Tan Takaaki Shiratori Matthew Hu Peihong Guo Xuhua Huang Ariyan Zarei Marco Pesavento Yichen Xu He Wen Teng Deng Wyatt Borsos Anjali Thakrar Jean-Charles Bazin Carsten Stoll Gin\'es Hidalgo James Booth Lucy Wang Xiaowen Ma Yu Rong Sairanjith Thalanki Chen Cao Christian H\"ane Abhishek Kar Sofien Bouaziz Jason Saragih Yaser Sheikh Shunsuke Saito

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:29 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords 3D avatar modelinglarge-scale pretrainingcodec avatarsfull-body reconstructionfacial animationgeneralizationin-the-wild datapost-training

0 comments

The pith

Pretraining on one million in-the-wild videos followed by post-training on curated data produces high-fidelity full-body 3D avatars that generalize across identities and styles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a pre- and post-training approach for 3D avatar modeling that first builds broad priors from a million everyday videos and then refines them on high-quality data. This addresses the long-standing tension between studio-level fidelity, which usually fails outside controlled settings, and large-scale generalization, which has historically produced low-quality results. The resulting Large-scale Codec Avatars deliver feedforward inference with precise expression and finger control while preserving identity across varied hair, clothing, and demographics. A sympathetic reader would care because the method appears to unlock real-world usability without sacrificing the fine details needed for convincing animation.

Core claim

The authors demonstrate that pretraining a 3D avatar model on 1M in-the-wild videos learns general priors over appearance and geometry; subsequent post-training on high-quality curated data then improves expressivity and fidelity. The resulting LCA model generalizes across hair styles, clothing, and demographics, supplies precise facial expressions and finger-level articulation, and maintains strong identity preservation. It also exhibits emergent capabilities such as relightability and loose-garment support on unconstrained inputs, together with zero-shot robustness to stylized imagery, all without direct supervision for those properties.

What carries the argument

the pre/post-training paradigm for Large-scale Codec Avatars (LCA), in which pretraining on 1M in-the-wild videos supplies broad appearance and geometry priors that post-training on high-quality data then sharpens for expressivity and control

If this is right

Avatars achieve precise fine-grained control over facial expressions and finger articulations.
The model generalizes across hair styles, clothing, and demographic groups while preserving identity.
Emergent support appears for relighting and loose garments on unconstrained inputs.
Zero-shot robustness to stylized imagery occurs without explicit training for that case.
Full-body avatars can be generated in a single feedforward pass at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The two-stage recipe suggests that data-scale benefits observed in language and 2D vision models may extend to 3D human modeling.
Casual video capture could become sufficient for creating personalized avatars, reducing the need for per-person studio sessions.
The feedforward design opens straightforward integration into real-time virtual environments that must handle varied real-world lighting and clothing.

Load-bearing premise

Pretraining on 1M in-the-wild videos produces priors that transfer cleanly to high-fidelity post-training without residual domain-gap artifacts or loss of fine control.

What would settle it

A controlled comparison in which post-trained avatars show measurable drops in finger articulation accuracy or identity preservation on held-out demographic groups relative to a high-quality-only baseline would falsify the transfer claim.

Figures

Figures reproduced from arXiv: 2604.02320 by Abhishek Kar, Anjali Thakrar, Ariyan Zarei, Carsten Stoll, Chen Cao, Chengan He, Christian H\"ane, Egor Zakharov, Giljoo Nam, Gin\'es Hidalgo, He Wen, James Booth, Jason Saragih, Jean-Charles Bazin, Jihyun Lee, Julieta Martinez, Junxuan Li, Kai Li, Lingchen Yang, Lucy Wang, Marco Pesavento, Matthew Hu, Peihong Guo, Qingyang Tan, Rawal Khirodkar, Rinat Abdrashitov, Sairanjith Thalanki, Shunsuke Saito, Sofien Bouaziz, Takaaki Shiratori, Teng Deng, Wyatt Borsos, Xiaowen Ma, Xuhua Huang, Yaser Sheikh, Yichen Xu, Yuan Dong, Yu Rong, Zhaoen Su, Zhongshi Jiang.

**Figure 1.** Figure 1: Large-scale Codec Avatars (LCA). (Left) We generate avatars from a handful of images in seconds. LCA follows the pre/posttraining paradigm, achieving broad generalization with high-fidelity reconstruction over pretraining alone. (Middle) The resulting avatars are highly detailed with faithful 3D structures, fully animatable with expression, gaze, and body pose, even for out-of-domain samples. (Right) LCA … view at source ↗

**Figure 2.** Figure 2: (Left) Overview. Given multiple images of a subject, we extract image tokens from full-body images and face crops, and geometric tokens from a template mesh. The LCA encoder alternates image-only, geometry-only, and multimodal attention to fuse information across streams. Our decoders, canonical and pose-dependent, predict Gaussian attributes, which are skinned via linear blend skinning (LBS) and rendered… view at source ↗

**Figure 3.** Figure 3: Node-Based Deformation Model. We use a flexible two-level learnable deformation model to adapt skinning weights learning for post-training. by [14, 21, 46, 57], we introduce a two-level (coarseto-fine) learnable deformation module. We define a set of intermediate nodes n ∈ R Nnode×3 to encode a lowdimensional deformation subspace articulated by θ through node-level skinning weights W′ . These nodes drive… view at source ↗

**Figure 4.** Figure 4: Pretraining vs. Post-Training. Qualitative comparison of models trained on multiple data sources and training strategies. Method Capture-Studio In-the-Wild L1↓ LPIPS↓ PSNR↑ L1↓ LPIPS↓ PSNR↑ (a) Multi-view ExAvatar [44] 0.011 0.065 23.925 0.033 0.120 18.010 UP2You [9] 0.021 0.085 19.585 0.025 0.086 19.062 MV-LHM* [50] 0.008 0.043 26.606 0.009 0.040 25.796 LCA (Ours) 0.007 0.041 27.483 0.008 0.029 27.802 (b)… view at source ↗

**Figure 5.** Figure 5: provides qualitative comparisons. LCA avatars produce faithful geometry and appearance, whereas other methods often exhibit 3D distortions, especially around the nose and hips. Our avatar also preserves fine appearance details such as shoelaces, benefiting from high-resolution supervision. Despite never being post-trained on examples with eyewear or headwear, LCA generalizes to these, as well as to divers… view at source ↗

**Figure 8.** Figure 8: Relightable LCA. We demonstrate relighting under HDRI environment maps and point lights, alongside the recovered intrinsic properties of the avatar. contrast, the modified model more faithfully deforms loose garments without splitting inside the garments even for unseen identities casually captured with a mobile phone. Relighting. We post-train our relightable LCA extension on multi-view captures with tim… view at source ↗

**Figure 7.** Figure 7: Loose Garment Support (LGS). (Left) Frontal view of the input condition. (Middle) Post-trained LCA avatar without loose garment support, while the general shape is recovered, skirts behave like pants when moving. (Right) LCA with loose garment support produces plausible animations without splitting garments. Input Env. Map 1 Env. Map 2 Point Light Intrinsics (albedo/normal/diffuse/specular) [PITH_FULL_IMA… view at source ↗

read the original abstract

High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pre/post-training on 1M wild videos for codec avatars produces claimed emergent generalization, but the abstract gives no metrics or ablations so the actual contribution of the pretraining stage remains unproven.

read the letter

The core move here is the pre/post-training split: pretrain a full-body codec avatar model on a million in-the-wild videos to pick up broad appearance and geometry priors, then post-train on high-quality studio captures for expressivity and control. That scale and the explicit two-stage recipe look new for this subfield, and the reported zero-shot robustness to stylized inputs plus loose-garment handling without direct supervision is the part worth watching if it holds up in the full experiments.

Referee Report

2 major / 1 minor

Summary. The paper introduces Large-Scale Codec Avatars (LCA), a high-fidelity full-body 3D avatar model that employs a pre/post-training paradigm: pretraining on 1M in-the-wild videos to acquire broad priors on appearance and geometry, followed by post-training on high-quality curated data to boost expressivity and fidelity. It claims generalization across hair styles, clothing, and demographics, precise facial expressions and finger-level control, strong identity preservation, and emergent properties such as relightability, loose garment support, and zero-shot robustness to stylized imagery without direct supervision.

Significance. If validated with quantitative evidence, this pre/post-training approach could be significant for 3D avatar modeling by bridging the scale of in-the-wild data with studio fidelity, potentially enabling more generalizable feedforward avatar systems for real-world applications.

major comments (2)

[Abstract] Abstract: The claims of precise fine-grained facial expressions, finger-level articulation control, and emergent generalization (relightability, loose garments, stylized robustness) are stated qualitatively but supply no quantitative metrics, baselines, ablation studies, or error analysis to isolate the pretraining contribution or measure domain-gap closure.
[Training paradigm description] Training paradigm description: The pre/post-training procedure is outlined at a high level with no details on stage integration (weight freezing, learning-rate schedules, regularization to prevent forgetting, or loss weighting), which is load-bearing for the central claim that broad priors from 1M in-the-wild videos transfer cleanly without residual geometry or appearance inconsistencies.

minor comments (1)

[Abstract] The phrase 'world-scale populations' is imprecise; specifying the demographic coverage or diversity statistics of the 1M-video pretraining set would strengthen the generalization claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting areas where additional evidence and detail would strengthen the central claims. We will revise the manuscript to incorporate quantitative metrics, ablations, and expanded training details as outlined below.

read point-by-point responses

Referee: [Abstract] Abstract: The claims of precise fine-grained facial expressions, finger-level articulation control, and emergent generalization (relightability, loose garments, stylized robustness) are stated qualitatively but supply no quantitative metrics, baselines, ablation studies, or error analysis to isolate the pretraining contribution or measure domain-gap closure.

Authors: We agree that the abstract is primarily qualitative and that stronger quantitative support is needed to substantiate the claims and isolate the pretraining effect. The full manuscript contains qualitative demonstrations and some comparative results, but lacks dedicated ablations and error analysis for domain-gap closure. In the revision we will add quantitative metrics (e.g., expression and pose reconstruction errors, identity preservation scores), baselines against prior avatar models, and ablation studies that measure the contribution of the 1M-video pretraining stage, including analysis of residual inconsistencies. revision: yes
Referee: [Training paradigm description] Training paradigm description: The pre/post-training procedure is outlined at a high level with no details on stage integration (weight freezing, learning-rate schedules, regularization to prevent forgetting, or loss weighting), which is load-bearing for the central claim that broad priors from 1M in-the-wild videos transfer cleanly without residual geometry or appearance inconsistencies.

Authors: We acknowledge that the current description of the pre/post-training pipeline is high-level and does not specify the integration mechanics. In the revised manuscript we will expand the methods section with concrete details on weight freezing (or lack thereof), learning-rate schedules across stages, regularization terms used to mitigate forgetting, and the loss-weighting scheme between pretraining and post-training objectives. These additions will directly address how the broad priors are preserved without introducing geometry or appearance inconsistencies. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pre/post-training pipeline is self-contained

full rationale

The paper presents a standard two-stage neural network training procedure—pretrain a codec avatar model on 1M in-the-wild videos, then post-train on curated high-quality captures—without any equations, uniqueness theorems, or fitted parameters that reduce the reported generalization or fidelity claims to quantities defined inside the paper itself. All performance assertions rest on external experimental outcomes (reconstruction metrics, qualitative results on held-out data) rather than algebraic identities or self-referential definitions. No load-bearing self-citations, ansatzes smuggled via prior work, or renamings of known results appear in the provided text; the method is therefore independently falsifiable by re-running the training pipeline on the stated data splits.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical hypothesis that large-scale in-the-wild pretraining supplies transferable priors; this is not derived from first principles but asserted as the basis for the observed generalization.

free parameters (1)

Pretraining dataset size
1M videos chosen to learn broad priors; exact selection criteria and filtering not specified in abstract.

axioms (1)

domain assumption Pretraining on in-the-wild videos learns broad priors over appearance and geometry that transfer to high-fidelity post-training
Explicitly stated as the motivation for the two-stage approach.

pith-pipeline@v0.9.0 · 5740 in / 1297 out tokens · 37935 ms · 2026-05-13T21:29:47.631794+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos
cs.CV 2026-04 unverdicted novelty 7.0

GenLCA enables scalable training of a 3D diffusion model for photorealistic, animatable full-body avatars by tokenizing large-scale real-world videos with a pretrained reconstructor and applying visibility-aware diffu...

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

The digital emily project: Achieving a photorealistic digital actor.IEEE Computer Graphics and Applications, 30 (4):20–31, 2010

Oleg Alexander, Mike Rogers, William Lambeth, Jen-Yuan Chiang, Wan-Chun Ma, Chuan-Chang Wang, and Paul De- bevec. The digital emily project: Achieving a photorealistic digital actor.IEEE Computer Graphics and Applications, 30 (4):20–31, 2010. 3

work page 2010
[3]

Driving-signal aware full-body avatars.ACM Transactions on Graphics (TOG), 40(4):1–17,

Timur Bagautdinov, Chenglei Wu, Tomas Simon, Fabian Prada, Takaaki Shiratori, Shih-En Wei, Weipeng Xu, Yaser Sheikh, and Jason Saragih. Driving-signal aware full-body avatars.ACM Transactions on Graphics (TOG), 40(4):1–17,

work page
[4]

Lumiere: A space-time diffu- sion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffu- sion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 2, 3

work page 2024
[5]

Deep relightable appearance models for animatable faces.ACM Transactions on Graphics (ToG), 40(4):1–15, 2021

Sai Bi, Stephen Lombardi, Shunsuke Saito, Tomas Simon, Shih-En Wei, Kevyn Mcphail, Ravi Ramamoorthi, Yaser Sheikh, and Jason Saragih. Deep relightable appearance models for animatable faces.ACM Transactions on Graphics (ToG), 40(4):1–15, 2021. 3, 5

work page 2021
[6]

Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the net- work.arXiv preprint arXiv:2504.13181, 2025. 2

work page internal anchor Pith review arXiv 2025
[7]

Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 3

work page 1901
[8]

Ge- nie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Ge- nie: Generative interactive environments. InForty-first Inter- national Conference on Machine Learning, 2024. 3

work page 2024
[9]

Up2you: Fast reconstruc- tion of yourself from unconstrained photo collections.arXiv preprint arXiv:2509.24817, 2025

Zeyu Cai, Ziyang Li, Xiaoben Li, Boqian Li, Zeyu Wang, Zhenyu Zhang, and Yuliang Xiu. Up2you: Fast reconstruc- tion of yourself from unconstrained photo collections.arXiv preprint arXiv:2509.24817, 2025. 6

work page arXiv 2025
[10]

Authen- tic volumetric avatars from a phone scan.ACM Transactions on Graphics (TOG), 41(4):1–19, 2022

Chen Cao, Tomas Simon, Jin Kyu Kim, Gabe Schwartz, Michael Zollhoefer, Shunsuke Saito, Stephen Lombardi, Shih-En Wei, Danielle Belko, Shoou-I Yu, et al. Authen- tic volumetric avatars from a phone scan.ACM Transactions on Graphics (TOG), 41(4):1–19, 2022. 2

work page 2022
[11]

PERSE: per- sonalized 3d generative avatars from A single portrait

Hyunsoo Cha, Inhee Lee, and Hanbyul Joo. PERSE: per- sonalized 3d generative avatars from A single portrait. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 15953–15962. Computer Vision Foundation / IEEE, 2025. 6

work page 2025
[12]

Meshavatar: Learning high-quality triangular human avatars from multi-view videos

Yushuo Chen, Zerong Zheng, Zhe Li, Chao Xu, and Yebin Liu. Meshavatar: Learning high-quality triangular human avatars from multi-view videos. InEuropean Conference on Computer Vision, pages 250–269. Springer, 2024. 3

work page 2024
[13]

CoRR , volume =

Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Ju Li, Dechao Meng, Jinwei Qi, Penchong Qiao, et al. Wan-animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025. 3, 7, 2

work page arXiv 2025
[14]

Inverse kinematics for reduced deformable models.ACM Transac- tions on graphics (TOG), 25(3):1174–1179, 2006

Kevin G Der, Robert W Sumner, and Jovan Popovi´c. Inverse kinematics for reduced deformable models.ACM Transac- tions on graphics (TOG), 25(3):1174–1179, 2006. 5

work page 2006
[15]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018. 3

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

work page
[17]

A formal evaluation of psnr as quality measurement parameter for image segmen- tation algorithms.arXiv preprint arXiv:1605.07116, 2016

Fernando A Fardo, Victor H Conforto, Francisco C De Oliveira, and Paulo S Rodrigues. A formal evaluation of psnr as quality measurement parameter for image segmen- tation algorithms.arXiv preprint arXiv:1605.07116, 2016. 6

work page arXiv 2016
[18]

Mhr: Momentum human rig.arXiv preprint arXiv:2511.15586, 2025

Aaron Ferguson, Ahmed AA Osman, Berta Bescos, Carsten Stoll, Chris Twigg, Christoph Lassner, David Otte, Eric Vig- nola, Fabian Prada, Federica Bogo, et al. Mhr: Momentum human rig.arXiv preprint arXiv:2511.15586, 2025. 1

work page arXiv 2025
[19]

Dynamic neural radiance fields for monocular 4d facial avatar reconstruction

Guy Gafni, Justus Thies, Michael Zollh ¨ofer, and Matthias Nießner. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. InIEEE Conference on Com- puter Vision and Pattern Recognition, CVPR 2021, pages 8649–8658. Computer Vision Foundation / IEEE, 2021. 3

work page 2021
[20]

Vid2avatar: 3d avatar reconstruction from videos in the wild via self-supervised scene decomposition

Chen Guo, Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. Vid2avatar: 3d avatar reconstruction from videos in the wild via self-supervised scene decomposition. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12858–12868, 2023. 3

work page 2023
[21]

Reloo: Reconstructing humans dressed in loose garments from monocular video in the wild

Chen Guo, Tianjian Jiang, Manuel Kaufmann, Chengwei Zheng, Julien Valentin, Jie Song, and Otmar Hilliges. Reloo: Reconstructing humans dressed in loose garments from monocular video in the wild. InEuropean Conference on Computer Vision, pages 21–38. Springer, 2024. 5

work page 2024
[22]

Vid2avatar-pro: Authentic avatar from videos in the wild via universal prior

Chen Guo, Junxuan Li, Yash Kant, Yaser Sheikh, Shunsuke Saito, and Chen Cao. Vid2avatar-pro: Authentic avatar from videos in the wild via universal prior. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5559–5570, 2025. 2, 3

work page 2025
[23]

The re- lightables: V olumetric performance capture of humans with realistic relighting.ACM Transactions on Graphics (ToG), 38(6):1–19, 2019

Kaiwen Guo, Peter Lincoln, Philip Davidson, Jay Busch, Xueming Yu, Matt Whalen, Geoff Harvey, Sergio Orts- Escolano, Rohit Pandey, Jason Dourgarian, et al. The re- lightables: V olumetric performance capture of humans with realistic relighting.ACM Transactions on Graphics (ToG), 38(6):1–19, 2019. 3

work page 2019
[24]

Diffrelight: Diffusion-based facial performance relighting

Mingming He, Pascal Clausen, Ahmet Levent Tas ¸el, Li Ma, Oliver Pilarski, Wenqi Xian, Laszlo Rikker, Xueming Yu, Ryan Burgert, Ning Yu, et al. Diffrelight: Diffusion-based facial performance relighting. InSIGGRAPH Asia 2024 Conference Papers, pages 1–12, 2024. 3

work page 2024
[25]

Lam: Large avatar model for one-shot animatable gaus- sian head

Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: Large avatar model for one-shot animatable gaus- sian head. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–13, 2025. 2

work page 2025
[26]

Animate anyone: Consistent and controllable image- to-video synthesis for character animation

Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024. 3

work page 2024
[27]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lam- ple, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

In- stantavatar: Learning avatars from monocular video in 60 seconds

Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. In- stantavatar: Learning avatars from monocular video in 60 seconds. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16922– 16932, 2023. 3

work page 2023
[29]

Neuman: Neural human radiance field from a single video

Wei Jiang, Kwang Moo Yi, Golnoosh Samei, Oncel Tuzel, and Anurag Ranjan. Neuman: Neural human radiance field from a single video. InEuropean Conference on Computer Vision, pages 402–418. Springer, 2022. 3

work page 2022
[30]

Robust dual gaussian splatting for immersive human-centric volu- metric videos.ACM Transactions on Graphics (TOG), 43 (6):1–15, 2024

Yuheng Jiang, Zhehao Shen, Yu Hong, Chengcheng Guo, Yize Wu, Yingliang Zhang, Jingyi Yu, and Lan Xu. Robust dual gaussian splatting for immersive human-centric volu- metric videos.ACM Transactions on Graphics (TOG), 43 (6):1–15, 2024. 5

work page 2024
[31]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4):1–14, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4):1–14, 2023. 2

work page 2023
[32]

Sapiens: Foundation for human vision mod- els

Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vision mod- els. InEuropean Conference on Computer Vision, pages 206–228. Springer, 2024. 2, 3, 5, 6, 1

work page 2024
[33]

Personabooth: Per- sonalized text-to-motion generation

Boeun Kim, Hea In Jeong, JungHoon Sung, Yihua Cheng, Jeongmin Lee, Ju Yong Chang, Sang-Il Choi, Younggeun Choi, Saim Shin, Jungho Kim, et al. Personabooth: Per- sonalized text-to-motion generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22756–22765, 2025. 3

work page 2025
[34]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Uravatar: Universal relightable gaussian codec avatars

Junxuan Li, Chen Cao, Gabriel Schwartz, Rawal Khirodkar, Christian Richardt, Tomas Simon, Yaser Sheikh, and Shun- suke Saito. Uravatar: Universal relightable gaussian codec avatars. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 2, 3

work page 2024
[36]

Tava: Template-free animatable volumetric actors

Ruilong Li, Julian Tanke, Minh V o, Michael Zollh ¨ofer, J¨urgen Gall, Angjoo Kanazawa, and Christoph Lassner. Tava: Template-free animatable volumetric actors. InEu- ropean Conference on Computer Vision, pages 419–436. Springer, 2022. 2

work page 2022
[37]

Ani- matable gaussians: Learning pose-dependent gaussian maps for high-fidelity human avatar modeling

Zhe Li, Zerong Zheng, Lizhen Wang, and Yebin Liu. Ani- matable gaussians: Learning pose-dependent gaussian maps for high-fidelity human avatar modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19711–19722, 2024. 3

work page 2024
[38]

Learning implicit templates for point-based clothed human modeling

Siyou Lin, Hongwen Zhang, Zerong Zheng, Ruizhi Shao, and Yebin Liu. Learning implicit templates for point-based clothed human modeling. InEuropean Conference on Com- puter Vision, pages 210–228. Springer, 2022. 5

work page 2022
[39]

Lucas: Layered universal codec avatars

Di Liu, Teng Deng, Giljoo Nam, Yu Rong, Stanislav Pid- horskyi, Junxuan Li, Jason Saragih, Dimitris N Metaxas, and Chen Cao. Lucas: Layered universal codec avatars. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 21127–21137, 2025. 2

work page 2025
[40]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Pixel codec avatars

Shugao Ma, Tomas Simon, Jason Saragih, Dawei Wang, Yuecheng Li, Fernando De La Torre, and Yaser Sheikh. Pixel codec avatars. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 64–73,

work page
[42]

Codec avatar studio: Paired human captures for complete, drive- able, and generalizable avatars.Advances in Neural Infor- mation Processing Systems, 37:83008–83023, 2024

Julieta Martinez, Emily Kim, Javier Romero, Timur Bagaut- dinov, Shunsuke Saito, Shoou-I Yu, Stuart Anderson, Michael Zollh ¨ofer, Te-Li Wang, Shaojie Bai, et al. Codec avatar studio: Paired human captures for complete, drive- able, and generalizable avatars.Advances in Neural Infor- mation Processing Systems, 37:83008–83023, 2024. 2, 3, 5

work page 2024
[43]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 3

work page 2021
[44]

Expressive whole-body 3d gaussian avatar

Gyeongsik Moon, Takaaki Shiratori, and Shunsuke Saito. Expressive whole-body 3d gaussian avatar. InEuropean Conference on Computer Vision, pages 19–35. Springer,

work page
[45]

Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 3

work page 2022
[46]

Predicting loose-fitting garment defor- mations using bone-driven motion networks

Xiaoyu Pan, Jiaming Mai, Xinwei Jiang, Dongxue Tang, Jingxiang Li, Tianjia Shao, Kun Zhou, Xiaogang Jin, and Dinesh Manocha. Predicting loose-fitting garment defor- mations using bone-driven motion networks. InACM SIG- GRAPH 2022 conference proceedings, pages 1–10, 2022. 5

work page 2022
[47]

Atlas: Decoupling skeletal and shape parameters for expressive parametric hu- man modeling

Jinhyung Park, Javier Romero, Shunsuke Saito, Fabian Prada, Takaaki Shiratori, Yichen Xu, Federica Bogo, Shoou- I Yu, Kris Kitani, and Rawal Khirodkar. Atlas: Decoupling skeletal and shape parameters for expressive parametric hu- man modeling. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 6508–6518,

work page
[48]

Ani- matable neural radiance fields for modeling dynamic human bodies

Sida Peng, Junting Dong, Qianqian Wang, Shangzhan Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. Ani- matable neural radiance fields for modeling dynamic human bodies. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 14314–14323, 2021. 3

work page 2021
[49]

Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians

Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 20299–20309,

work page
[50]

LHM: large animatable human reconstruction model from a single image in seconds

Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, Guanying Chen, Zilong Dong, and Liefeng Bo. LHM: large animatable human reconstruction model from a single image in seconds. CoRR, abs/2503.10625, 2025. 2, 3, 4, 5, 6, 7, 1

work page arXiv 2025
[51]

Pf-lhm: 3d animatable avatar reconstruc- tion from pose-free articulated human images.arXiv preprint arXiv:2506.13766, 2025

Lingteng Qiu, Peihao Li, Qi Zuo, Xiaodong Gu, Yuan Dong, Weihao Yuan, Siyu Zhu, Xiaoguang Han, Guanying Chen, and Zilong Dong. Pf-lhm: 3d animatable avatar reconstruc- tion from pose-free articulated human images.arXiv preprint arXiv:2506.13766, 2025. 2, 5

work page arXiv 2025
[52]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023. 3

work page 2023
[53]

Relightable gaussian codec avatars

Shunsuke Saito, Gabriel Schwartz, Tomas Simon, Junxuan Li, and Giljoo Nam. Relightable gaussian codec avatars. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 130–141, 2024. 2, 3

work page 2024
[54]

Exploring multimodal diffusion transform- ers for enhanced prompt-based image editing

Joonghyuk Shin, Alchan Hwang, Yujin Kim, Daneul Kim, and Jaesik Park. Exploring multimodal diffusion transform- ers for enhanced prompt-based image editing. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 19492–19502, 2025. 2

work page 2025
[55]

Persona: Personalized whole-body 3d avatar with pose-driven deformations from a single image

Geonhee Sim and Gyeongsik Moon. Persona: Personalized whole-body 3d avatar with pose-driven deformations from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12670–12680, 2025. 3, 6

work page 2025
[56]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Em- bedded deformation for shape manipulation

Robert W Sumner, Johannes Schmid, and Mark Pauly. Em- bedded deformation for shape manipulation. InACM sig- graph 2007 papers, pages 80–es. 2007. 5

work page 2007
[58]

Light stage super- resolution: continuous high-frequency relighting.ACM Transactions on Graphics (TOG), 39(6):1–12, 2020

Tiancheng Sun, Zexiang Xu, Xiuming Zhang, Sean Fanello, Christoph Rhemann, Paul Debevec, Yun-Ta Tsai, Jonathan T Barron, and Ravi Ramamoorthi. Light stage super- resolution: continuous high-frequency relighting.ACM Transactions on Graphics (TOG), 39(6):1–12, 2020. 3

work page 2020
[59]

Scale efficiently: Insights from pre-training and fine-tuning transformers

Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler. Scale effi- ciently: Insights from pre-training and fine-tuning transform- ers.arXiv preprint arXiv:2109.10686, 2021. 3

work page arXiv 2021
[60]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 4, 1

work page 2025
[65]

A survey on 3d human avatar modeling–from reconstruction to generation.arXiv preprint arXiv:2406.04253, 2024

Ruihe Wang, Yukang Cao, Kai Han, and Kwan-Yee K Wong. A survey on 3d human avatar modeling–from reconstruction to generation.arXiv preprint arXiv:2406.04253, 2024. 3

work page arXiv 2024
[66]

Relightable full-body gaussian codec avatars

Shaofei Wang, Tomas Simon, Igor Santesteban, Timur Bagautdinov, Junxuan Li, Vasu Agrawal, Fabian Prada, Shoou-I Yu, Pace Nalbone, Matt Gramlich, et al. Relightable full-body gaussian codec avatars. InProceedings of the Special Interest Group on Computer Graphics and Interac- tive Techniques Conference Conference Papers, pages 1–12,

work page
[67]

Template-free single-view 3d human digitalization with diffusion-guided lrm.arXiv preprint arXiv:2401.12175, 2024

Zhenzhen Weng, Jingyuan Liu, Hao Tan, Zhan Xu, Yang Zhou, Serena Yeung-Levy, and Jimei Yang. Template-free single-view 3d human digitalization with diffusion-guided lrm.arXiv preprint arXiv:2401.12175, 2024. 3

work page arXiv 2024
[68]

Understanding and improving layer normaliza- tion.Advances in neural information processing systems, 32,

Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and Junyang Lin. Understanding and improving layer normaliza- tion.Advances in neural information processing systems, 32,

work page
[69]

Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Pro- cessing Systems, 37:660–684, 2024

Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Pro- cessing Systems, 37:660–684, 2024. 2

work page 2024
[70]

Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians

Yuelang Xu, Benwang Chen, Zhe Li, Hongwen Zhang, Lizhen Wang, Zerong Zheng, and Yebin Liu. Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1931–1941, 2024. 2

work page 1931
[71]

Towards practical capture of high-fidelity relightable avatars

Haotian Yang, Mingwu Zheng, Wanquan Feng, Haibin Huang, Yu-Kun Lai, Pengfei Wan, Zhongyuan Wang, and Chongyang Ma. Towards practical capture of high-fidelity relightable avatars. InSIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023. 3

work page 2023
[72]

Kevin Yu, Gleb Gorbachev, Ulrich Eck, Frieder Pankratz, Nassir Navab, and Daniel Roth. Avatars for teleconsultation: Effects of avatar embodiment techniques on user perception in 3d asymmetric telepresence.IEEE Transactions on Visu- alization and Computer Graphics, 27(11):4129–4139, 2021. 2

work page 2021
[73]

Guava: Gen- eralizable upper body 3d gaussian avatar

Dongbin Zhang, Yunfei Liu, Lijian Lin, Ye Zhu, Yang Li, Minghan Qin, Yu Li, and Haoqian Wang. Guava: Gen- eralizable upper body 3d gaussian avatar. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 14205–14217, 2025. 7, 2, 3

work page 2025
[74]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 4, 6

work page 2018
[75]

Avatarrex: Real-time expressive full- body avatars.ACM Transactions on Graphics (TOG), 42(4): 1–19, 2023

Zerong Zheng, Xiaochen Zhao, Hongwen Zhang, Boning Liu, and Yebin Liu. Avatarrex: Real-time expressive full- body avatars.ACM Transactions on Graphics (TOG), 42(4): 1–19, 2023. 2, 3

work page 2023
[76]

Idol: Instant photorealistic 3d human creation from a single image

Yiyu Zhuang, Jiaxi Lv, Hao Wen, Qing Shuai, Ailing Zeng, Hao Zhu, Shifeng Chen, Yujiu Yang, Xun Cao, and Wei Liu. Idol: Instant photorealistic 3d human creation from a single image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26308–26319, 2025. 3

work page 2025
[77]

Driv- able 3d gaussian avatars

Wojciech Zielonka, Timur Bagautdinov, Shunsuke Saito, Michael Zollh ¨ofer, Justus Thies, and Javier Romero. Driv- able 3d gaussian avatars. In2025 International Conference on 3D Vision (3DV), pages 979–990. IEEE, 2025. 2, 3 Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining Supplementary Material A. Additional Qual...

work page arXiv 2025