pith. machine review for the scientific record. sign in

arxiv: 2604.02320 · v2 · submitted 2026-04-02 · 💻 cs.CV · cs.GR

Recognition: no theorem link

Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:29 UTC · model grok-4.3

classification 💻 cs.CV cs.GR
keywords 3D avatar modelinglarge-scale pretrainingcodec avatarsfull-body reconstructionfacial animationgeneralizationin-the-wild datapost-training
0
0 comments X

The pith

Pretraining on one million in-the-wild videos followed by post-training on curated data produces high-fidelity full-body 3D avatars that generalize across identities and styles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a pre- and post-training approach for 3D avatar modeling that first builds broad priors from a million everyday videos and then refines them on high-quality data. This addresses the long-standing tension between studio-level fidelity, which usually fails outside controlled settings, and large-scale generalization, which has historically produced low-quality results. The resulting Large-scale Codec Avatars deliver feedforward inference with precise expression and finger control while preserving identity across varied hair, clothing, and demographics. A sympathetic reader would care because the method appears to unlock real-world usability without sacrificing the fine details needed for convincing animation.

Core claim

The authors demonstrate that pretraining a 3D avatar model on 1M in-the-wild videos learns general priors over appearance and geometry; subsequent post-training on high-quality curated data then improves expressivity and fidelity. The resulting LCA model generalizes across hair styles, clothing, and demographics, supplies precise facial expressions and finger-level articulation, and maintains strong identity preservation. It also exhibits emergent capabilities such as relightability and loose-garment support on unconstrained inputs, together with zero-shot robustness to stylized imagery, all without direct supervision for those properties.

What carries the argument

the pre/post-training paradigm for Large-scale Codec Avatars (LCA), in which pretraining on 1M in-the-wild videos supplies broad appearance and geometry priors that post-training on high-quality data then sharpens for expressivity and control

If this is right

  • Avatars achieve precise fine-grained control over facial expressions and finger articulations.
  • The model generalizes across hair styles, clothing, and demographic groups while preserving identity.
  • Emergent support appears for relighting and loose garments on unconstrained inputs.
  • Zero-shot robustness to stylized imagery occurs without explicit training for that case.
  • Full-body avatars can be generated in a single feedforward pass at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The two-stage recipe suggests that data-scale benefits observed in language and 2D vision models may extend to 3D human modeling.
  • Casual video capture could become sufficient for creating personalized avatars, reducing the need for per-person studio sessions.
  • The feedforward design opens straightforward integration into real-time virtual environments that must handle varied real-world lighting and clothing.

Load-bearing premise

Pretraining on 1M in-the-wild videos produces priors that transfer cleanly to high-fidelity post-training without residual domain-gap artifacts or loss of fine control.

What would settle it

A controlled comparison in which post-trained avatars show measurable drops in finger articulation accuracy or identity preservation on held-out demographic groups relative to a high-quality-only baseline would falsify the transfer claim.

Figures

Figures reproduced from arXiv: 2604.02320 by Abhishek Kar, Anjali Thakrar, Ariyan Zarei, Carsten Stoll, Chen Cao, Chengan He, Christian H\"ane, Egor Zakharov, Giljoo Nam, Gin\'es Hidalgo, He Wen, James Booth, Jason Saragih, Jean-Charles Bazin, Jihyun Lee, Julieta Martinez, Junxuan Li, Kai Li, Lingchen Yang, Lucy Wang, Marco Pesavento, Matthew Hu, Peihong Guo, Qingyang Tan, Rawal Khirodkar, Rinat Abdrashitov, Sairanjith Thalanki, Shunsuke Saito, Sofien Bouaziz, Takaaki Shiratori, Teng Deng, Wyatt Borsos, Xiaowen Ma, Xuhua Huang, Yaser Sheikh, Yichen Xu, Yuan Dong, Yu Rong, Zhaoen Su, Zhongshi Jiang.

Figure 1
Figure 1. Figure 1: Large-scale Codec Avatars (LCA). (Left) We generate avatars from a handful of images in seconds. LCA follows the pre/post￾training paradigm, achieving broad generalization with high-fidelity reconstruction over pretraining alone. (Middle) The resulting avatars are highly detailed with faithful 3D structures, fully animatable with expression, gaze, and body pose, even for out-of-domain samples. (Right) LCA … view at source ↗
Figure 2
Figure 2. Figure 2: (Left) Overview. Given multiple images of a subject, we extract image tokens from full-body images and face crops, and geomet￾ric tokens from a template mesh. The LCA encoder alternates image-only, geometry-only, and multimodal attention to fuse information across streams. Our decoders, canonical and pose-dependent, predict Gaussian attributes, which are skinned via linear blend skinning (LBS) and rendered… view at source ↗
Figure 3
Figure 3. Figure 3: Node-Based Deformation Model. We use a flexible two-level learnable deformation model to adapt skinning weights learning for post-training. by [14, 21, 46, 57], we introduce a two-level (coarse￾to-fine) learnable deformation module. We define a set of intermediate nodes n ∈ R Nnode×3 to encode a low￾dimensional deformation subspace articulated by θ through node-level skinning weights W′ . These nodes drive… view at source ↗
Figure 4
Figure 4. Figure 4: Pretraining vs. Post-Training. Qualitative comparison of models trained on multiple data sources and training strategies. Method Capture-Studio In-the-Wild L1↓ LPIPS↓ PSNR↑ L1↓ LPIPS↓ PSNR↑ (a) Multi-view ExAvatar [44] 0.011 0.065 23.925 0.033 0.120 18.010 UP2You [9] 0.021 0.085 19.585 0.025 0.086 19.062 MV-LHM* [50] 0.008 0.043 26.606 0.009 0.040 25.796 LCA (Ours) 0.007 0.041 27.483 0.008 0.029 27.802 (b)… view at source ↗
Figure 5
Figure 5. Figure 5: provides qualitative comparisons. LCA avatars produce faithful geometry and appearance, whereas other methods often exhibit 3D distortions, especially around the nose and hips. Our avatar also preserves fine appearance details such as shoelaces, benefiting from high-resolution supervision. Despite never being post-trained on exam￾ples with eyewear or headwear, LCA generalizes to these, as well as to divers… view at source ↗
Figure 8
Figure 8. Figure 8: Relightable LCA. We demonstrate relighting under HDRI environment maps and point lights, alongside the recovered intrinsic properties of the avatar. contrast, the modified model more faithfully deforms loose garments without splitting inside the garments even for un￾seen identities casually captured with a mobile phone. Relighting. We post-train our relightable LCA extension on multi-view captures with tim… view at source ↗
Figure 7
Figure 7. Figure 7: Loose Garment Support (LGS). (Left) Frontal view of the input condition. (Middle) Post-trained LCA avatar without loose garment support, while the general shape is recovered, skirts behave like pants when moving. (Right) LCA with loose garment support produces plausible animations without splitting garments. Input Env. Map 1 Env. Map 2 Point Light Intrinsics (albedo/normal/diffuse/specular) [PITH_FULL_IMA… view at source ↗
read the original abstract

High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Large-Scale Codec Avatars (LCA), a high-fidelity full-body 3D avatar model that employs a pre/post-training paradigm: pretraining on 1M in-the-wild videos to acquire broad priors on appearance and geometry, followed by post-training on high-quality curated data to boost expressivity and fidelity. It claims generalization across hair styles, clothing, and demographics, precise facial expressions and finger-level control, strong identity preservation, and emergent properties such as relightability, loose garment support, and zero-shot robustness to stylized imagery without direct supervision.

Significance. If validated with quantitative evidence, this pre/post-training approach could be significant for 3D avatar modeling by bridging the scale of in-the-wild data with studio fidelity, potentially enabling more generalizable feedforward avatar systems for real-world applications.

major comments (2)
  1. [Abstract] Abstract: The claims of precise fine-grained facial expressions, finger-level articulation control, and emergent generalization (relightability, loose garments, stylized robustness) are stated qualitatively but supply no quantitative metrics, baselines, ablation studies, or error analysis to isolate the pretraining contribution or measure domain-gap closure.
  2. [Training paradigm description] Training paradigm description: The pre/post-training procedure is outlined at a high level with no details on stage integration (weight freezing, learning-rate schedules, regularization to prevent forgetting, or loss weighting), which is load-bearing for the central claim that broad priors from 1M in-the-wild videos transfer cleanly without residual geometry or appearance inconsistencies.
minor comments (1)
  1. [Abstract] The phrase 'world-scale populations' is imprecise; specifying the demographic coverage or diversity statistics of the 1M-video pretraining set would strengthen the generalization claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting areas where additional evidence and detail would strengthen the central claims. We will revise the manuscript to incorporate quantitative metrics, ablations, and expanded training details as outlined below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claims of precise fine-grained facial expressions, finger-level articulation control, and emergent generalization (relightability, loose garments, stylized robustness) are stated qualitatively but supply no quantitative metrics, baselines, ablation studies, or error analysis to isolate the pretraining contribution or measure domain-gap closure.

    Authors: We agree that the abstract is primarily qualitative and that stronger quantitative support is needed to substantiate the claims and isolate the pretraining effect. The full manuscript contains qualitative demonstrations and some comparative results, but lacks dedicated ablations and error analysis for domain-gap closure. In the revision we will add quantitative metrics (e.g., expression and pose reconstruction errors, identity preservation scores), baselines against prior avatar models, and ablation studies that measure the contribution of the 1M-video pretraining stage, including analysis of residual inconsistencies. revision: yes

  2. Referee: [Training paradigm description] Training paradigm description: The pre/post-training procedure is outlined at a high level with no details on stage integration (weight freezing, learning-rate schedules, regularization to prevent forgetting, or loss weighting), which is load-bearing for the central claim that broad priors from 1M in-the-wild videos transfer cleanly without residual geometry or appearance inconsistencies.

    Authors: We acknowledge that the current description of the pre/post-training pipeline is high-level and does not specify the integration mechanics. In the revised manuscript we will expand the methods section with concrete details on weight freezing (or lack thereof), learning-rate schedules across stages, regularization terms used to mitigate forgetting, and the loss-weighting scheme between pretraining and post-training objectives. These additions will directly address how the broad priors are preserved without introducing geometry or appearance inconsistencies. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pre/post-training pipeline is self-contained

full rationale

The paper presents a standard two-stage neural network training procedure—pretrain a codec avatar model on 1M in-the-wild videos, then post-train on curated high-quality captures—without any equations, uniqueness theorems, or fitted parameters that reduce the reported generalization or fidelity claims to quantities defined inside the paper itself. All performance assertions rest on external experimental outcomes (reconstruction metrics, qualitative results on held-out data) rather than algebraic identities or self-referential definitions. No load-bearing self-citations, ansatzes smuggled via prior work, or renamings of known results appear in the provided text; the method is therefore independently falsifiable by re-running the training pipeline on the stated data splits.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical hypothesis that large-scale in-the-wild pretraining supplies transferable priors; this is not derived from first principles but asserted as the basis for the observed generalization.

free parameters (1)
  • Pretraining dataset size
    1M videos chosen to learn broad priors; exact selection criteria and filtering not specified in abstract.
axioms (1)
  • domain assumption Pretraining on in-the-wild videos learns broad priors over appearance and geometry that transfer to high-fidelity post-training
    Explicitly stated as the motivation for the two-stage approach.

pith-pipeline@v0.9.0 · 5740 in / 1297 out tokens · 37935 ms · 2026-05-13T21:29:47.631794+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    GenLCA enables scalable training of a 3D diffusion model for photorealistic, animatable full-body avatars by tokenizing large-scale real-world videos with a pretrained reconstructor and applying visibility-aware diffu...

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    The digital emily project: Achieving a photorealistic digital actor.IEEE Computer Graphics and Applications, 30 (4):20–31, 2010

    Oleg Alexander, Mike Rogers, William Lambeth, Jen-Yuan Chiang, Wan-Chun Ma, Chuan-Chang Wang, and Paul De- bevec. The digital emily project: Achieving a photorealistic digital actor.IEEE Computer Graphics and Applications, 30 (4):20–31, 2010. 3

  3. [3]

    Driving-signal aware full-body avatars.ACM Transactions on Graphics (TOG), 40(4):1–17,

    Timur Bagautdinov, Chenglei Wu, Tomas Simon, Fabian Prada, Takaaki Shiratori, Shih-En Wei, Weipeng Xu, Yaser Sheikh, and Jason Saragih. Driving-signal aware full-body avatars.ACM Transactions on Graphics (TOG), 40(4):1–17,

  4. [4]

    Lumiere: A space-time diffu- sion model for video generation

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffu- sion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 2, 3

  5. [5]

    Deep relightable appearance models for animatable faces.ACM Transactions on Graphics (ToG), 40(4):1–15, 2021

    Sai Bi, Stephen Lombardi, Shunsuke Saito, Tomas Simon, Shih-En Wei, Kevyn Mcphail, Ravi Ramamoorthi, Yaser Sheikh, and Jason Saragih. Deep relightable appearance models for animatable faces.ACM Transactions on Graphics (ToG), 40(4):1–15, 2021. 3, 5

  6. [6]

    Perception Encoder: The best visual embeddings are not at the output of the network

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the net- work.arXiv preprint arXiv:2504.13181, 2025. 2

  7. [7]

    Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 3

  8. [8]

    Ge- nie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Ge- nie: Generative interactive environments. InForty-first Inter- national Conference on Machine Learning, 2024. 3

  9. [9]

    Up2you: Fast reconstruc- tion of yourself from unconstrained photo collections.arXiv preprint arXiv:2509.24817, 2025

    Zeyu Cai, Ziyang Li, Xiaoben Li, Boqian Li, Zeyu Wang, Zhenyu Zhang, and Yuliang Xiu. Up2you: Fast reconstruc- tion of yourself from unconstrained photo collections.arXiv preprint arXiv:2509.24817, 2025. 6

  10. [10]

    Authen- tic volumetric avatars from a phone scan.ACM Transactions on Graphics (TOG), 41(4):1–19, 2022

    Chen Cao, Tomas Simon, Jin Kyu Kim, Gabe Schwartz, Michael Zollhoefer, Shunsuke Saito, Stephen Lombardi, Shih-En Wei, Danielle Belko, Shoou-I Yu, et al. Authen- tic volumetric avatars from a phone scan.ACM Transactions on Graphics (TOG), 41(4):1–19, 2022. 2

  11. [11]

    PERSE: per- sonalized 3d generative avatars from A single portrait

    Hyunsoo Cha, Inhee Lee, and Hanbyul Joo. PERSE: per- sonalized 3d generative avatars from A single portrait. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 15953–15962. Computer Vision Foundation / IEEE, 2025. 6

  12. [12]

    Meshavatar: Learning high-quality triangular human avatars from multi-view videos

    Yushuo Chen, Zerong Zheng, Zhe Li, Chao Xu, and Yebin Liu. Meshavatar: Learning high-quality triangular human avatars from multi-view videos. InEuropean Conference on Computer Vision, pages 250–269. Springer, 2024. 3

  13. [13]

    CoRR , volume =

    Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Ju Li, Dechao Meng, Jinwei Qi, Penchong Qiao, et al. Wan-animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025. 3, 7, 2

  14. [14]

    Inverse kinematics for reduced deformable models.ACM Transac- tions on graphics (TOG), 25(3):1174–1179, 2006

    Kevin G Der, Robert W Sumner, and Jovan Popovi´c. Inverse kinematics for reduced deformable models.ACM Transac- tions on graphics (TOG), 25(3):1174–1179, 2006. 5

  15. [15]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018. 3

  16. [16]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  17. [17]

    A formal evaluation of psnr as quality measurement parameter for image segmen- tation algorithms.arXiv preprint arXiv:1605.07116, 2016

    Fernando A Fardo, Victor H Conforto, Francisco C De Oliveira, and Paulo S Rodrigues. A formal evaluation of psnr as quality measurement parameter for image segmen- tation algorithms.arXiv preprint arXiv:1605.07116, 2016. 6

  18. [18]

    Mhr: Momentum human rig.arXiv preprint arXiv:2511.15586, 2025

    Aaron Ferguson, Ahmed AA Osman, Berta Bescos, Carsten Stoll, Chris Twigg, Christoph Lassner, David Otte, Eric Vig- nola, Fabian Prada, Federica Bogo, et al. Mhr: Momentum human rig.arXiv preprint arXiv:2511.15586, 2025. 1

  19. [19]

    Dynamic neural radiance fields for monocular 4d facial avatar reconstruction

    Guy Gafni, Justus Thies, Michael Zollh ¨ofer, and Matthias Nießner. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. InIEEE Conference on Com- puter Vision and Pattern Recognition, CVPR 2021, pages 8649–8658. Computer Vision Foundation / IEEE, 2021. 3

  20. [20]

    Vid2avatar: 3d avatar reconstruction from videos in the wild via self-supervised scene decomposition

    Chen Guo, Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. Vid2avatar: 3d avatar reconstruction from videos in the wild via self-supervised scene decomposition. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12858–12868, 2023. 3

  21. [21]

    Reloo: Reconstructing humans dressed in loose garments from monocular video in the wild

    Chen Guo, Tianjian Jiang, Manuel Kaufmann, Chengwei Zheng, Julien Valentin, Jie Song, and Otmar Hilliges. Reloo: Reconstructing humans dressed in loose garments from monocular video in the wild. InEuropean Conference on Computer Vision, pages 21–38. Springer, 2024. 5

  22. [22]

    Vid2avatar-pro: Authentic avatar from videos in the wild via universal prior

    Chen Guo, Junxuan Li, Yash Kant, Yaser Sheikh, Shunsuke Saito, and Chen Cao. Vid2avatar-pro: Authentic avatar from videos in the wild via universal prior. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5559–5570, 2025. 2, 3

  23. [23]

    The re- lightables: V olumetric performance capture of humans with realistic relighting.ACM Transactions on Graphics (ToG), 38(6):1–19, 2019

    Kaiwen Guo, Peter Lincoln, Philip Davidson, Jay Busch, Xueming Yu, Matt Whalen, Geoff Harvey, Sergio Orts- Escolano, Rohit Pandey, Jason Dourgarian, et al. The re- lightables: V olumetric performance capture of humans with realistic relighting.ACM Transactions on Graphics (ToG), 38(6):1–19, 2019. 3

  24. [24]

    Diffrelight: Diffusion-based facial performance relighting

    Mingming He, Pascal Clausen, Ahmet Levent Tas ¸el, Li Ma, Oliver Pilarski, Wenqi Xian, Laszlo Rikker, Xueming Yu, Ryan Burgert, Ning Yu, et al. Diffrelight: Diffusion-based facial performance relighting. InSIGGRAPH Asia 2024 Conference Papers, pages 1–12, 2024. 3

  25. [25]

    Lam: Large avatar model for one-shot animatable gaus- sian head

    Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: Large avatar model for one-shot animatable gaus- sian head. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–13, 2025. 2

  26. [26]

    Animate anyone: Consistent and controllable image- to-video synthesis for character animation

    Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024. 3

  27. [27]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lam- ple, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023. 3

  28. [28]

    In- stantavatar: Learning avatars from monocular video in 60 seconds

    Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. In- stantavatar: Learning avatars from monocular video in 60 seconds. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16922– 16932, 2023. 3

  29. [29]

    Neuman: Neural human radiance field from a single video

    Wei Jiang, Kwang Moo Yi, Golnoosh Samei, Oncel Tuzel, and Anurag Ranjan. Neuman: Neural human radiance field from a single video. InEuropean Conference on Computer Vision, pages 402–418. Springer, 2022. 3

  30. [30]

    Robust dual gaussian splatting for immersive human-centric volu- metric videos.ACM Transactions on Graphics (TOG), 43 (6):1–15, 2024

    Yuheng Jiang, Zhehao Shen, Yu Hong, Chengcheng Guo, Yize Wu, Yingliang Zhang, Jingyi Yu, and Lan Xu. Robust dual gaussian splatting for immersive human-centric volu- metric videos.ACM Transactions on Graphics (TOG), 43 (6):1–15, 2024. 5

  31. [31]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4):1–14, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4):1–14, 2023. 2

  32. [32]

    Sapiens: Foundation for human vision mod- els

    Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vision mod- els. InEuropean Conference on Computer Vision, pages 206–228. Springer, 2024. 2, 3, 5, 6, 1

  33. [33]

    Personabooth: Per- sonalized text-to-motion generation

    Boeun Kim, Hea In Jeong, JungHoon Sung, Yihua Cheng, Jeongmin Lee, Ju Yong Chang, Sang-Il Choi, Younggeun Choi, Saim Shin, Jungho Kim, et al. Personabooth: Per- sonalized text-to-motion generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22756–22765, 2025. 3

  34. [34]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2, 3

  35. [35]

    Uravatar: Universal relightable gaussian codec avatars

    Junxuan Li, Chen Cao, Gabriel Schwartz, Rawal Khirodkar, Christian Richardt, Tomas Simon, Yaser Sheikh, and Shun- suke Saito. Uravatar: Universal relightable gaussian codec avatars. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 2, 3

  36. [36]

    Tava: Template-free animatable volumetric actors

    Ruilong Li, Julian Tanke, Minh V o, Michael Zollh ¨ofer, J¨urgen Gall, Angjoo Kanazawa, and Christoph Lassner. Tava: Template-free animatable volumetric actors. InEu- ropean Conference on Computer Vision, pages 419–436. Springer, 2022. 2

  37. [37]

    Ani- matable gaussians: Learning pose-dependent gaussian maps for high-fidelity human avatar modeling

    Zhe Li, Zerong Zheng, Lizhen Wang, and Yebin Liu. Ani- matable gaussians: Learning pose-dependent gaussian maps for high-fidelity human avatar modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19711–19722, 2024. 3

  38. [38]

    Learning implicit templates for point-based clothed human modeling

    Siyou Lin, Hongwen Zhang, Zerong Zheng, Ruizhi Shao, and Yebin Liu. Learning implicit templates for point-based clothed human modeling. InEuropean Conference on Com- puter Vision, pages 210–228. Springer, 2022. 5

  39. [39]

    Lucas: Layered universal codec avatars

    Di Liu, Teng Deng, Giljoo Nam, Yu Rong, Stanislav Pid- horskyi, Junxuan Li, Jason Saragih, Dimitris N Metaxas, and Chen Cao. Lucas: Layered universal codec avatars. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 21127–21137, 2025. 2

  40. [40]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

  41. [41]

    Pixel codec avatars

    Shugao Ma, Tomas Simon, Jason Saragih, Dawei Wang, Yuecheng Li, Fernando De La Torre, and Yaser Sheikh. Pixel codec avatars. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 64–73,

  42. [42]

    Codec avatar studio: Paired human captures for complete, drive- able, and generalizable avatars.Advances in Neural Infor- mation Processing Systems, 37:83008–83023, 2024

    Julieta Martinez, Emily Kim, Javier Romero, Timur Bagaut- dinov, Shunsuke Saito, Shoou-I Yu, Stuart Anderson, Michael Zollh ¨ofer, Te-Li Wang, Shaojie Bai, et al. Codec avatar studio: Paired human captures for complete, drive- able, and generalizable avatars.Advances in Neural Infor- mation Processing Systems, 37:83008–83023, 2024. 2, 3, 5

  43. [43]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 3

  44. [44]

    Expressive whole-body 3d gaussian avatar

    Gyeongsik Moon, Takaaki Shiratori, and Shunsuke Saito. Expressive whole-body 3d gaussian avatar. InEuropean Conference on Computer Vision, pages 19–35. Springer,

  45. [45]

    Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 3

  46. [46]

    Predicting loose-fitting garment defor- mations using bone-driven motion networks

    Xiaoyu Pan, Jiaming Mai, Xinwei Jiang, Dongxue Tang, Jingxiang Li, Tianjia Shao, Kun Zhou, Xiaogang Jin, and Dinesh Manocha. Predicting loose-fitting garment defor- mations using bone-driven motion networks. InACM SIG- GRAPH 2022 conference proceedings, pages 1–10, 2022. 5

  47. [47]

    Atlas: Decoupling skeletal and shape parameters for expressive parametric hu- man modeling

    Jinhyung Park, Javier Romero, Shunsuke Saito, Fabian Prada, Takaaki Shiratori, Yichen Xu, Federica Bogo, Shoou- I Yu, Kris Kitani, and Rawal Khirodkar. Atlas: Decoupling skeletal and shape parameters for expressive parametric hu- man modeling. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 6508–6518,

  48. [48]

    Ani- matable neural radiance fields for modeling dynamic human bodies

    Sida Peng, Junting Dong, Qianqian Wang, Shangzhan Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. Ani- matable neural radiance fields for modeling dynamic human bodies. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 14314–14323, 2021. 3

  49. [49]

    Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians

    Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 20299–20309,

  50. [50]

    LHM: large animatable human reconstruction model from a single image in seconds

    Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, Guanying Chen, Zilong Dong, and Liefeng Bo. LHM: large animatable human reconstruction model from a single image in seconds. CoRR, abs/2503.10625, 2025. 2, 3, 4, 5, 6, 7, 1

  51. [51]

    Pf-lhm: 3d animatable avatar reconstruc- tion from pose-free articulated human images.arXiv preprint arXiv:2506.13766, 2025

    Lingteng Qiu, Peihao Li, Qi Zuo, Xiaodong Gu, Yuan Dong, Weihao Yuan, Siyu Zhu, Xiaoguang Han, Guanying Chen, and Zilong Dong. Pf-lhm: 3d animatable avatar reconstruc- tion from pose-free articulated human images.arXiv preprint arXiv:2506.13766, 2025. 2, 5

  52. [52]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023. 3

  53. [53]

    Relightable gaussian codec avatars

    Shunsuke Saito, Gabriel Schwartz, Tomas Simon, Junxuan Li, and Giljoo Nam. Relightable gaussian codec avatars. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 130–141, 2024. 2, 3

  54. [54]

    Exploring multimodal diffusion transform- ers for enhanced prompt-based image editing

    Joonghyuk Shin, Alchan Hwang, Yujin Kim, Daneul Kim, and Jaesik Park. Exploring multimodal diffusion transform- ers for enhanced prompt-based image editing. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 19492–19502, 2025. 2

  55. [55]

    Persona: Personalized whole-body 3d avatar with pose-driven deformations from a single image

    Geonhee Sim and Gyeongsik Moon. Persona: Personalized whole-body 3d avatar with pose-driven deformations from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12670–12680, 2025. 3, 6

  56. [56]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2

  57. [57]

    Em- bedded deformation for shape manipulation

    Robert W Sumner, Johannes Schmid, and Mark Pauly. Em- bedded deformation for shape manipulation. InACM sig- graph 2007 papers, pages 80–es. 2007. 5

  58. [58]

    Light stage super- resolution: continuous high-frequency relighting.ACM Transactions on Graphics (TOG), 39(6):1–12, 2020

    Tiancheng Sun, Zexiang Xu, Xiuming Zhang, Sean Fanello, Christoph Rhemann, Paul Debevec, Yun-Ta Tsai, Jonathan T Barron, and Ravi Ramamoorthi. Light stage super- resolution: continuous high-frequency relighting.ACM Transactions on Graphics (TOG), 39(6):1–12, 2020. 3

  59. [59]

    Scale efficiently: Insights from pre-training and fine-tuning transformers

    Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler. Scale effi- ciently: Insights from pre-training and fine-tuning transform- ers.arXiv preprint arXiv:2109.10686, 2021. 3

  60. [60]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2

  61. [61]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  62. [62]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 2, 3

  63. [63]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 2, 3

  64. [64]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 4, 1

  65. [65]

    A survey on 3d human avatar modeling–from reconstruction to generation.arXiv preprint arXiv:2406.04253, 2024

    Ruihe Wang, Yukang Cao, Kai Han, and Kwan-Yee K Wong. A survey on 3d human avatar modeling–from reconstruction to generation.arXiv preprint arXiv:2406.04253, 2024. 3

  66. [66]

    Relightable full-body gaussian codec avatars

    Shaofei Wang, Tomas Simon, Igor Santesteban, Timur Bagautdinov, Junxuan Li, Vasu Agrawal, Fabian Prada, Shoou-I Yu, Pace Nalbone, Matt Gramlich, et al. Relightable full-body gaussian codec avatars. InProceedings of the Special Interest Group on Computer Graphics and Interac- tive Techniques Conference Conference Papers, pages 1–12,

  67. [67]

    Template-free single-view 3d human digitalization with diffusion-guided lrm.arXiv preprint arXiv:2401.12175, 2024

    Zhenzhen Weng, Jingyuan Liu, Hao Tan, Zhan Xu, Yang Zhou, Serena Yeung-Levy, and Jimei Yang. Template-free single-view 3d human digitalization with diffusion-guided lrm.arXiv preprint arXiv:2401.12175, 2024. 3

  68. [68]

    Understanding and improving layer normaliza- tion.Advances in neural information processing systems, 32,

    Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and Junyang Lin. Understanding and improving layer normaliza- tion.Advances in neural information processing systems, 32,

  69. [69]

    Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Pro- cessing Systems, 37:660–684, 2024

    Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Pro- cessing Systems, 37:660–684, 2024. 2

  70. [70]

    Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians

    Yuelang Xu, Benwang Chen, Zhe Li, Hongwen Zhang, Lizhen Wang, Zerong Zheng, and Yebin Liu. Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1931–1941, 2024. 2

  71. [71]

    Towards practical capture of high-fidelity relightable avatars

    Haotian Yang, Mingwu Zheng, Wanquan Feng, Haibin Huang, Yu-Kun Lai, Pengfei Wan, Zhongyuan Wang, and Chongyang Ma. Towards practical capture of high-fidelity relightable avatars. InSIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023. 3

  72. [72]

    Kevin Yu, Gleb Gorbachev, Ulrich Eck, Frieder Pankratz, Nassir Navab, and Daniel Roth. Avatars for teleconsultation: Effects of avatar embodiment techniques on user perception in 3d asymmetric telepresence.IEEE Transactions on Visu- alization and Computer Graphics, 27(11):4129–4139, 2021. 2

  73. [73]

    Guava: Gen- eralizable upper body 3d gaussian avatar

    Dongbin Zhang, Yunfei Liu, Lijian Lin, Ye Zhu, Yang Li, Minghan Qin, Yu Li, and Haoqian Wang. Guava: Gen- eralizable upper body 3d gaussian avatar. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 14205–14217, 2025. 7, 2, 3

  74. [74]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 4, 6

  75. [75]

    Avatarrex: Real-time expressive full- body avatars.ACM Transactions on Graphics (TOG), 42(4): 1–19, 2023

    Zerong Zheng, Xiaochen Zhao, Hongwen Zhang, Boning Liu, and Yebin Liu. Avatarrex: Real-time expressive full- body avatars.ACM Transactions on Graphics (TOG), 42(4): 1–19, 2023. 2, 3

  76. [76]

    Idol: Instant photorealistic 3d human creation from a single image

    Yiyu Zhuang, Jiaxi Lv, Hao Wen, Qing Shuai, Ailing Zeng, Hao Zhu, Shifeng Chen, Yujiu Yang, Xun Cao, and Wei Liu. Idol: Instant photorealistic 3d human creation from a single image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26308–26319, 2025. 3

  77. [77]

    Driv- able 3d gaussian avatars

    Wojciech Zielonka, Timur Bagautdinov, Shunsuke Saito, Michael Zollh ¨ofer, Justus Thies, and Javier Romero. Driv- able 3d gaussian avatars. In2025 International Conference on 3D Vision (3DV), pages 979–990. IEEE, 2025. 2, 3 Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining Supplementary Material A. Additional Qual...