pith. sign in

arxiv: 2606.31981 · v1 · pith:EN4CQ2FRnew · submitted 2026-06-30 · 💻 cs.CV · cs.AI

LUNA: Learning Universal 3D Human Animation Beyond Skinning

Pith reviewed 2026-07-01 05:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 3D human animationneural avatarGaussian deformation2D drivingtransformer regressorhybrid supervisionLBS-free modelcross-identity generalization
0
0 comments X

The pith

LUNA maps 2D controls directly to 3D Gaussian human deformations without linear blend skinning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LUNA as an end-to-end neural animation model that accepts multiple 2D inputs including images, keypoints, sketches, and unseen characters and outputs 3D Gaussian deformations for human motion. A transformer-based regressor separates global rigid motion from local non-rigid dynamics while hybrid supervision distills structural knowledge from an LBS teacher and trains on both fitted data and large unlabeled videos. This setup targets the limitations of parametric models such as fitting artifacts and restricted expressivity. A sympathetic reader would care if the approach truly enables scalable, generalizable 3D avatars driven by everyday 2D signals.

Core claim

LUNA is the first end-to-end 3D animatable model supporting implicit 2D driving by directly mapping multiple 2D controls to 3D Gaussian deformations via a transformer motion regressor that disentangles global rigid motion from fine-grained local dynamics, trained with hybrid supervision that distills soft priors from an LBS teacher on limited fitted data plus large in-the-wild videos.

What carries the argument

Transformer-based motion regressor that disentangles global rigid motion from local dynamics, paired with hybrid LBS-teacher distillation and a loss supporting both fitted and unlabeled video data.

If this is right

  • LUNA reaches visual fidelity competitive with LBS-based methods.
  • It produces realistic human motion with zero-shot cross-identity generalization.
  • It handles diverse 2D driving modalities including sketches and unseen characters.
  • Training scales to large unlabeled video collections beyond limited fitted datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The LBS-free design may extend naturally to clothing and hair dynamics that parametric models handle poorly.
  • Implicit 2D driving could support real-time avatar control from casual smartphone video.
  • The hybrid supervision pattern may apply to other 3D reconstruction tasks that mix synthetic and in-the-wild data.

Load-bearing premise

Hybrid supervision from an LBS teacher plus a loss usable on both fitted data and unlabeled videos suffices to resolve 2D-to-3D lifting ambiguity and scale beyond fitted datasets.

What would settle it

Failure to produce coherent deformations on a set of in-the-wild 2D driving videos where traditional LBS fitting is known to be ambiguous would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.31981 by Chen Cao, Junxuan Li, Peng Li, Rawal Khirodkar, Shunsuke Saito, Wenhan Luo, Yike Guo, Yuan Dong, Yuan Liu.

Figure 1
Figure 1. Figure 1: Given a handful of human images, LUNA reconstructs a high-fidelity animat￾able 3D avatar, supporting versatile 2D control signals - including RGB images, 2D keypoints, hand-drawn sketches, and other unseen characters - without any additional preprocessing. Project page: https://penghtyx.github.io/LUNA/ . Abstract. Creating photorealistic, animatable 3D human avatars from monocular images still largely depe… view at source ↗
Figure 2
Figure 2. Figure 2: Overview. Given N unposed multi-view identity images and a 2D driving signal, LUNA first reconstructs canonical 3D Gaussians with an Identity Encoder. A transformer-based Implicit Neural Animator then maps them to posed space condi￾tioned on the driving signal. During training, the driving image is randomly sampled across modalities (RGB, keypoints or sketches). components: an Identity Encoder (Sec. 3.1) t… view at source ↗
Figure 3
Figure 3. Figure 3: Our results with diverse 2D driving signals [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons on Cloth10K. LBS-dependent approach (IDOL and LHM) suffer from severe ID shift ot structural tearing. In contrast, LUNA preserves the structural integrity and continuous topology of the fabric. Baselines and Metrics. We compare against monocular optimization meth￾ods (Vid2Avatar [14], ExAvatar [44]) and recent feed-forward models (IDOL [84], LHM [52], UP2YOU [3]). We also include ou… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons of animation smoothness [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablations of distillation regularization and multiview finetuning. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Challenging pose-driven cases. LUNA handles challenging poses and loose￾cloth mismatch in many cases, while more extreme pose and body-shape mismatch re￾main difficult. strained non-rigid dynamics. Extensive experiments demonstrate LUNA’s ro￾bust zero-shot generalization across heterogeneous identities, loose clothing, and cross-modality driving sources, establishing a highly flexible paradigm for 3D digit… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison of pose accuracy. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Our animation results on NeuMan dataset [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
read the original abstract

Creating photorealistic, animatable 3D human avatars from monocular images still largely depends on Linear Blend Skinning (LBS) and parametric body models, which constrain expressivity and often introduce artifacts due to imperfect fitting. We propose LUNA, an LBS-free universal neural animation model that directly maps multiple 2D controls like images, keypoints, sketches, and unseen characters into 3D Gaussian deformations, bypassing explicit body fitting. At its core, a transformer-based motion regressor disentangles global rigid motion from fine-grained local dynamics to capture both coherent movement and subtle non-rigid effects. To resolve the inherent ambiguity of 2D-to-3D lifting while scaling beyond fitted datasets, we introduce hybrid supervision that distills soft structural priors from an LBS teacher and a loss that supports training on both limited fitted data and large in-the-wild unlabeled videos. Extensive experiments show LUNA achieves competitive visual fidelity compared to LBS-based approaches, while delivering realistic human motion and zero-shot cross-identity generalization across diverse driving modalities. To the best of our knowledge, LUNA is the first end-to-end 3D animatable model that supports implicit 2D driving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes LUNA, an LBS-free universal neural animation model that directly maps multiple 2D controls (images, keypoints, sketches, unseen characters) to 3D Gaussian deformations via a transformer-based motion regressor that disentangles global rigid motion from local dynamics. Hybrid supervision distills soft structural priors from an LBS teacher while supporting training on limited fitted data and large in-the-wild unlabeled videos to resolve 2D-to-3D ambiguity. The manuscript reports competitive visual fidelity to LBS-based methods, realistic motion, zero-shot cross-identity generalization across driving modalities, and claims to be the first end-to-end 3D animatable model supporting implicit 2D driving.

Significance. If the experimental claims hold with rigorous quantitative validation, this work could meaningfully advance 3D human animation by removing reliance on parametric LBS models, enabling greater expressivity for non-rigid effects and better scaling to in-the-wild data. The hybrid supervision strategy for handling lifting ambiguity represents a practical contribution worth evaluating against existing distillation and self-supervised approaches in the field.

major comments (2)
  1. Abstract: The claims of 'competitive visual fidelity', 'realistic human motion', and 'zero-shot cross-identity generalization' are asserted without any quantitative metrics, error analysis, ablation studies, or experimental details supplied in the text, preventing assessment of whether the hybrid supervision actually resolves the 2D-to-3D ambiguity or enables the reported scaling.
  2. Abstract: The assertion that LUNA is 'the first end-to-end 3D animatable model that supports implicit 2D driving' is a strong novelty claim; without the related-work section or explicit comparisons to prior implicit or end-to-end methods, it is impossible to verify whether this holds or if the hybrid supervision is the key differentiator.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and the opportunity to address these points regarding the abstract. We respond to each major comment below, clarifying the support provided in the full manuscript.

read point-by-point responses
  1. Referee: Abstract: The claims of 'competitive visual fidelity', 'realistic human motion', and 'zero-shot cross-identity generalization' are asserted without any quantitative metrics, error analysis, ablation studies, or experimental details supplied in the text, preventing assessment of whether the hybrid supervision actually resolves the 2D-to-3D ambiguity or enables the reported scaling.

    Authors: The abstract is a high-level summary. The full manuscript contains a dedicated Experiments section (Section 4) with quantitative metrics for visual fidelity and motion quality, error analysis, ablation studies on the hybrid supervision, and evaluation of its role in resolving 2D-to-3D ambiguity and enabling scaling to in-the-wild data. These results are presented with tables, figures, and comparisons. We can revise the abstract to include brief references to Section 4 for improved clarity. revision: partial

  2. Referee: Abstract: The assertion that LUNA is 'the first end-to-end 3D animatable model that supports implicit 2D driving' is a strong novelty claim; without the related-work section or explicit comparisons to prior implicit or end-to-end methods, it is impossible to verify whether this holds or if the hybrid supervision is the key differentiator.

    Authors: The manuscript includes a Related Work section (Section 2) reviewing LBS-based, implicit, and end-to-end animation methods, along with explicit comparisons in the Experiments section. The novelty claim is qualified as 'to the best of our knowledge' and is supported by positioning LUNA's LBS-free implicit 2D driving capability, with the hybrid supervision detailed in the Method section as a key enabler. No revision is needed as the supporting sections are present. revision: no

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The abstract and supplied text contain no equations, derivations, predictions, or self-citations that could reduce any claimed result to its inputs by construction. The hybrid supervision (LBS-teacher distillation plus in-the-wild loss) is presented as a methodological choice to address ambiguity, with no fitted-parameter-as-prediction pattern, uniqueness theorem, or ansatz smuggling visible. The central claim of being the first end-to-end implicit-2D-driving model is a novelty assertion, not a derivation. No load-bearing steps exist to analyze, so the paper's description remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No technical details on free parameters, axioms, or invented entities are present in the abstract.

pith-pipeline@v0.9.1-grok · 5766 in / 1026 out tokens · 36721 ms · 2026-07-01T05:18:07.714622+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 17 canonical work pages · 3 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Alldieck, T., Zanfir, M., Sminchisescu, C.: Photorealistic monocular 3d reconstruc- tion of humans wearing clothing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

  2. [2]

    ACM Transactions on Graphics (TOG)40(4), 1–17 (2021)

    Bagautdinov, T., Wu, C., Simon, T., Prada, F., Shiratori, T., Wei, S.E., Xu, W., Sheikh, Y., Saragih, J.: Driving-signal aware full-body avatars. ACM Transactions on Graphics (TOG)40(4), 1–17 (2021)

  3. [3]

    arXiv preprint arXiv:2509.24817 (2025)

    Cai, Z., Li, Z., Li, X., Li, B., Wang, Z., Zhang, Z., Xiu, Y.: Up2you: Fast reconstruction of yourself from unconstrained photo collections. arXiv preprint arXiv:2509.24817 (2025)

  4. [4]

    In: Proceed- ings of the IEEE/CVF international conference on computer vision

    Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now. In: Proceed- ings of the IEEE/CVF international conference on computer vision. pp. 5933–5942 (2019)

  5. [5]

    In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

    Chen, J., Hu, J., Wang, G., Jiang, Z., Zhou, T., Chen, Z., Lv, C.: Taoavatar: Real-time lifelike full-body talking avatars for augmented reality via 3d gaussian splatting. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 10723–10734 (2025)

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chen, J., Yi, W., Ma, L., Jia, X., Lu, H.: Gm-nerf: Learning generalizable model-based neural radiance fields from multi-view images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20648–20658 (June 2023)

  7. [7]

    In: ECCV (2022)

    Chen, M., Zhang, J., Xu, X., Liu, L., Cai, Y., Feng, J., Yan, S.: Geometry-guided progressive nerf for generalizable and efficient neural human rendering. In: ECCV (2022)

  8. [8]

    arXiv preprint arXiv:2510.07723 (2025)

    Chen, W., Li, P., Zheng, W., Zhao, C., Li, M., Zhu, Y., Dou, Z., Wang, R., Liu, Y.: Synchuman: Synchronizing 2d and 3d generative models for single-view human reconstruction. arXiv preprint arXiv:2510.07723 (2025)

  9. [9]

    Chen, Y., Zheng, Z., Li, Z., Xu, C., Liu, Y.: Meshavatar: Learning high-quality triangular human avatars from multi-view videos (2024),https://arxiv.org/ abs/2407.08414

  10. [10]

    arXiv preprint arXiv:2509.14055 (2025)

    Cheng, G., Gao, X., Hu, L., Hu, S., Huang, M., Ji, C., Li, J., Meng, D., Qi, J., Qiao, P., et al.: Wan-animate: Unified character animation and replacement with holistic replication. arXiv preprint arXiv:2509.14055 (2025)

  11. [11]

    In: CVPR (2023)

    Du, Y., Kips, R., Pumarola, A., Starke, S., Thabet, A., Sanakoyeu, A.: Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model. In: CVPR (2023)

  12. [12]

    In: ICML (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: ICML (2024)

  13. [13]

    Ferguson, A., Osman, A.A.A., Bescos, B., Stoll, C., Twigg, C., Lassner, C., Otte, D., Vignola, E., Prada, F., Bogo, F., Santesteban, I., Romero, J., Zarate, J., Lee, J., Park, J., Yang, J., Doublestein, J., Venkateshan, K., Kitani, K., Kavan, L., Farra, M.D., Hu, M., Cioffi, M., Fabris, M., Ranieri, M., Modarres, M., Kadlecek, P., Khirodkar, R., Abdrashit...

  14. [14]

    In: LUNA 17 ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition

    Guo, C., Jiang, T., Chen, X., Song, J., Hilliges, O.: Vid2avatar: 3d avatar re- construction from videos in the wild via self-supervised scene decomposition. In: LUNA 17 ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 12858–12868 (2023)

  15. [15]

    In: European conference on computer vision (ECCV) (2024)

    Guo, C., Jiang, T., Kaufmann, M., Zheng, C., Valentin, J., Song, J., Hilliges, O.: Reloo: Reconstructing humans dressed in loose garments from monocular video in the wild. In: European conference on computer vision (ECCV) (2024)

  16. [16]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Guo, C., Li, J., Kant, Y., Sheikh, Y., Saito, S., Cao, C.: Vid2avatar-pro: Authentic avatar from videos in the wild via universal prior. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5559–5570 (2025)

  17. [17]

    ACM Transactions on Graphics40(4) (aug 2021)

    Habermann, M., Liu, L., Xu, W., Zollhoefer, M., Pons-Moll, G., Theobalt, C.: Real-time deep dynamic characters. ACM Transactions on Graphics40(4) (aug 2021)

  18. [18]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    He, T., Xu, Y., Saito, S., Soatto, S., Tung, T.: Arch++: Animation-ready clothed human reconstruction revisited. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11046–11056 (2021)

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ho, I., Song, J., Hilliges, O., et al.: Sith: Single-view textured human reconstruction with image-conditioned diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 538–549 (2024)

  20. [20]

    LRM: Large Reconstruction Model for Single Image to 3D

    Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400 (2023)

  21. [21]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Hu, L.: Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8153–8163 (2024)

  22. [22]

    In: 2022 International Conference on 3D Vision (3DV) (2022)

    Hu,T.,Yu,T.,Zheng,Z.,Zhang,H.,Liu,Y.,Zwicker,M.:Hvtr:Hybridvolumetric- textural rendering for human avatars. In: 2022 International Conference on 3D Vision (3DV) (2022)

  23. [23]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., Xu, Y., Lassner, C., Li, H., Tung, T.: Arch: Animatable reconstruction of clothed humans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3093–3102 (2020)

  24. [24]

    In: ACM SIGGRAPH (2021)

    Jiakai, Z., Xinhang, L., Xinyi, Y., Fuqiang, Z., Yanshun, Z., Minye, W., Yingliang, Z., Lan, X., Jingyi, Y.: Editable free-viewpoint video using a layered neural repre- sentation. In: ACM SIGGRAPH (2021)

  25. [25]

    In: Proceedings of the European conference on computer vision (ECCV) (2022)

    Jiang, W., Yi, K.M., Samei, G., Tuzel, O., Ranjan, A.: Neuman: Neural human radiance field from a single video. In: Proceedings of the European conference on computer vision (ECCV) (2022)

  26. [26]

    Jung, H., Brasch, N., Song, J., Perez-Pellitero, E., Zhou, Y., Li, Z., Navab, N., Busam, B.: Deformable 3d gaussian splatting for animatable human avatars (2023), https://arxiv.org/abs/2312.15059

  27. [27]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

    Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3d human dynamics from video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

  28. [28]

    TOG (2023),https://repo- sam.inria.fr/ fungraph/3d-gaussian-splatting/

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. TOG (2023),https://repo- sam.inria.fr/ fungraph/3d-gaussian-splatting/

  29. [29]

    In: European Conference on Computer Vision

    Khirodkar, R., Bagautdinov, T., Martinez, J., Zhaoen, S., James, A., Selednik, P., Anderson, S., Saito, S.: Sapiens: Foundation for human vision models. In: European Conference on Computer Vision. pp. 206–228. Springer (2024)

  30. [30]

    In: European Conference on Computer Vision (2024) 18 P

    Kwon, Y., Fang, B., Lu, Y., Dong, H., Zhang, C., Carrasco, F.V., Mosella-Montoro, A., Xu, J., Takagi, S., Kim, D., Prakash, A., la Torre, F.D.: Generalizable human gaussians for sparse view synthesis. In: European Conference on Computer Vision (2024) 18 P. Li et al

  31. [31]

    Lei, J., Wang, Y., Pavlakos, G., Liu, L., Daniilidis, K.: Gart: Gaussian articulated template models (2023)

  32. [32]

    In: ACM SIGGRAPH 2024 Conference Papers (2024)

    Li, J., Cao, C., Schwartz, G., Khirodkar, R., Richardt, C., Simon, T., Sheikh, Y., Saito, S.: Uravatar: Universal relightable gaussian codec avatars. In: ACM SIGGRAPH 2024 Conference Papers (2024)

  33. [33]

    Li, M., Yao, S., Xie, Z., Chen, K.: Gaussianbody: Clothed human reconstruction via 3d gaussian splatting (2024),https://arxiv.org/abs/2401.09720

  34. [34]

    PSHuman: Photorealistic Single-image 3D Human Reconstruction using Cross-Scale Multiview Diffusion and Explicit Remeshing

    Li, P., Zheng, W., Liu, Y., Yu, T., Li, Y., Qi, X., Li, M., Chi, X., Xia, S., Xue, W., et al.: Pshuman: Photorealistic single-view human reconstruction using cross-scale diffusion. arXiv preprint arXiv:2409.10141 (2024)

  35. [35]

    In: European Conference on Com- puter Vision (ECCV) (2022)

    Li, R., Tanke, J., Vo, M., Zollhofer, M., Gall, J., Kanazawa, A., Lassner, C.: Tava: Template-free animatable volumetric actors. In: European Conference on Com- puter Vision (ECCV) (2022)

  36. [36]

    ACM SIGGRAPH Conference Pro- ceedings (2023)

    Li, Z., Zheng, Z., Liu, Y., Zhou, B., Liu, Y.: Posevocab: Learning joint-structured pose embeddings for human avatar modeling. ACM SIGGRAPH Conference Pro- ceedings (2023)

  37. [37]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, Z., Zheng, Z., Wang, L., Liu, Y.: Animatable gaussians: Learning pose- dependent gaussian maps for high-fidelity human avatar modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19711–19722 (2024)

  38. [38]

    In: CVPR (2024)

    Li, Z., Zheng, Z., Wang, L., Liu, Y.: Animatable gaussians: Learning pose- dependent gaussian maps for high-fidelity human avatar modeling. In: CVPR (2024)

  39. [39]

    ACM Trans

    Liu, L., Habermann, M., Rudnev, V., Sarkar, K., Gu, J., Theobalt, C.: Neural actor: Neural free-view synthesis of human actors with pose control. ACM Trans. Graph.(ACM SIGGRAPH Asia) (2021)

  40. [40]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Liu, W., Piao, Z., Min, J., Luo, W., Ma, L., Gao, S.: Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5904–5913 (2019)

  41. [41]

    TOG34(6), 1–16 (2015)

    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: a skinned multi-person linear model. TOG34(6), 1–16 (2015)

  42. [42]

    NeurIPS Track on Datasets and Benchmarks (2024)

    Martinez, J., Kim, E., Romero, J., Bagautdinov, T., Saito, S., Yu, S.I., Anderson, S., Zollhöfer, M., Wang, T.L., Bai, S., Li, C., Wei, S.E., Joshi, R., Borsos, W., Simon, T., Saragih, J., Theodosis, P., Greene, A., Josyula, A., Maeta, S.M., Jewett, A.I., Venshtain, S., Heilman, C., Chen, Y.T., Fu, S., Elshaer, M.E.A., Du, T., Wu, L., Chen, S.C., Kang, K....

  43. [43]

    In: ECCV

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: Representing scenes as neural radiance fields for view synthesis. In: ECCV. pp. 405–421 (2020)

  44. [44]

    In: ECCV (2024)

    Moon, G., Shiratori, T., Saito, S.: Expressive whole-body 3d gaussian avatar. In: ECCV (2024)

  45. [45]

    In: CVPR (2024) LUNA 19

    Moreau, A., Song, J., Dhamo, H., Shaw, R., Zhou, Y., Pérez-Pellitero, E.: Human gaussian splatting: Real-time rendering of animatable avatars. In: CVPR (2024) LUNA 19

  46. [46]

    In: International Conference on Computer Vision (2021)

    Noguchi, A., Sun, X., Lin, S., Harada, T.: Neural articulated radiance field. In: International Conference on Computer Vision (2021)

  47. [47]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Pang,H.,Zhu,H.,Kortylewski,A.,Theobalt,C.,Habermann,M.:Ash:Animatable gaussian splats for efficient and photoreal human rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1165–1175 (June 2024)

  48. [48]

    In: Proceedings IEEE Conf

    Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 10975–10985 (2019)

  49. [49]

    In: CVPR (2019)

    Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: CVPR (2019)

  50. [50]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Peng, S., Dong, J., Wang, Q., Zhang, S., Shuai, Q., Zhou, X., Bao, H.: Animatable neural radiance fields for modeling dynamic human bodies. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14314–14323 (2021)

  51. [51]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Peng, S., Zhang, Y., Xu, Y., Wang, Q., Shuai, Q., Bao, H., Zhou, X.: Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9054–9063 (2021)

  52. [52]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Qiu, L., Gu, X., Li, P., Zuo, Q., Shen, W., Zhang, J., Qiu, K., Yuan, W., Chen, G., Dong, Z., et al.: Lhm: Large animatable human reconstruction model for single image to 3d in seconds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14184–14194 (2025)

  53. [53]

    arXiv preprint arXiv:2506.13766 (2025)

    Qiu, L., Li, P., Zuo, Q., Gu, X., Dong, Y., Yuan, W., Zhu, S., Han, X., Chen, G., Dong, Z.: Pf-lhm: 3d animatable avatar reconstruction from pose-free articulated human images. arXiv preprint arXiv:2506.13766 (2025)

  54. [54]

    In: CVPR (2025)

    Qiu, L., Zhu, S., Zuo, Q., Gu, X., Dong, Y., Zhang, J., Xu, C., Li, Z., Yuan, W., Bo, L., et al.: Anigs: Animatable gaussian avatar from a single image with inconsistent gaussian reconstruction. In: CVPR (2025)

  55. [55]

    In: ACM SIGGRAPH 2022 Conference Proceedings

    Remelli, E., Bagautdinov, T., Saito, S., Wu, C., Simon, T., Wei, S.E., Guo, K., Cao, Z., Prada, F., Saragih, J., et al.: Drivable volumetric avatars using texel-aligned features. In: ACM SIGGRAPH 2022 Conference Proceedings. pp. 1–9 (2022)

  56. [56]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2304–2314 (2019)

  57. [57]

    In: CVPR (2024)

    Saito, S., Schwartz, G., Simon, T., Li, J., Nam, G.: Relightable gaussian codec avatars. In: CVPR (2024)

  58. [58]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Saito, S., Simon, T., Saragih, J., Joo, H.: Pifuhd: Multi-level pixel-aligned im- plicit function for high-resolution 3d human digitization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 84–93 (2020)

  59. [59]

    TOG43(6) (2024)

    Shao, R., Pang, Y., Zheng, Z., Sun, J., Liu, Y.: Human4dit: 360-degree human video generation with 4d diffusion transformer. TOG43(6) (2024)

  60. [60]

    In: Computer Vision and Pattern Recognition (CVPR) (2023)

    Shen, K., Guo, C., Kaufmann, M., Zarate, J., Valentin, J., Song, J., Hilliges, O.: X- avatar: Expressive human avatars. In: Computer Vision and Pattern Recognition (CVPR) (2023)

  61. [61]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, 20 P. Li et al. L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: DINOv3 (...

  62. [62]

    In: The Thirteenth International Conference on Learning Rep- resentations (2025)

    Song, C., Wu, Z., Su, S.Y., Wandt, B., Sigal, L., Rhodin, H.: Locality sensitive avatars from video. In: The Thirteenth International Conference on Learning Rep- resentations (2025)

  63. [63]

    In: 3DV (2025)

    Tan, J., Xiang, D., Tulsiani, S., Ramanan, D., Yang, G.: Dressrecon: Freeform 4d human reconstruction from monocular video. In: 3DV (2025)

  64. [64]

    In: European Conference on Computer Vision (ECCV) (2022)

    Wang, S., Schwarz, K., Geiger, A., Tang, S.: Arah: Animatable volume rendering of articulated human sdfs. In: European Conference on Computer Vision (ECCV) (2022)

  65. [65]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang,T.,Li,L.,Lin,K.,Zhai,Y.,Lin,C.C.,Yang,Z.,Zhang,H.,Liu,Z.,Wang,L.: Disco: Disentangled control for realistic human dance generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9326–9336 (2024)

  66. [66]

    In: CVPR (2022)

    Weng, C.Y., Curless, B., Srinivasan, P.P., Barron, J.T., Kemelmacher-Shlizerman, I.: Humannerf: Free-viewpoint rendering of moving people from monocular video. In: CVPR (2022)

  67. [67]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Xiu, Y., Yang, J., Cao, X., Tzionas, D., Black, M.J.: Econ: Explicit clothed humans optimized via normal integration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 512–523 (2023)

  68. [68]

    In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Xiu, Y., Yang, J., Tzionas, D., Black, M.J.: Icon: Implicit clothed humans obtained from normals. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13286–13296. IEEE (2022)

  69. [69]

    In: CVPR (2022)

    Xu, T., Fujita, Y., Matsumoto, E.: Surface-aligned neural radiance fields for con- trollable 3d human synthesis. In: CVPR (2022)

  70. [70]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xu, Z., Zhang, J., Liew, J.H., Yan, H., Liu, J.W., Zhang, C., Feng, J., Shou, M.Z.: Magicanimate: Temporally consistent human image animation using diffu- sion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1481–1490 (2024)

  71. [71]

    arXiv preprint arXiv:2602.15989 (2026)

    Yang, X., Kukreja, D., Pinkus, D., Sagar, A., Fan, T., Park, J., Shin, S., Cao, J., Liu, J., Ugrinovic, N., Feiszli, M., Malik, J., Dollar, P., Kitani, K.: Sam 3d body: Robust full-body human mesh recovery. arXiv preprint arXiv:2602.15989 (2026)

  72. [72]

    In: Advances in Neural Information Processing Systems (2021)

    Yariv, L., Gu, J., Kasten, Y., Lipman, Y.: Volume rendering of neural implicit surfaces. In: Advances in Neural Information Processing Systems (2021)

  73. [73]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (June 2021)

    Yu, T., Zheng, Z., Guo, K., Liu, P., Dai, Q., Liu, Y.: Function4d: Real-time hu- man volumetric capture from very sparse consumer rgbd sensors. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (June 2021)

  74. [74]

    In: Proceedings of the Spe- cial Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

    Yu, Z., Li, Z., Bao, H., Yang, C., Zhou, X.: Humanram: Feed-forward human re- construction and animation model using transformers. In: Proceedings of the Spe- cial Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–13 (2025)

  75. [75]

    arXiv preprint arXiv:2406.19680 (2024)

    Zhang, Y., Gu, J., Wang, L.W., Wang, H., Cheng, J., Zhu, Y., Zou, F.: Mimic- motion: High-quality human motion video generation with confidence-aware pose guidance. arXiv preprint arXiv:2406.19680 (2024)

  76. [76]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Zhao, F., Yang, W., Zhang, J., Lin, P., Zhang, Y., Yu, J., Xu, L.: Humannerf: Efficiently generated human radiance field from sparse inputs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7743–7753 (June 2022) LUNA 21

  77. [77]

    In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhao, J., Zhang, H.: Thin-plate spline motion model for image animation. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3657–3666 (2022)

  78. [78]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zheng, R., Li, P., Wang, H., Yu, T.: Learning visibility field for detailed 3d human reconstruction and relighting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 216–226 (2023)

  79. [79]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Zheng, S., Zhou, B., Shao, R., Liu, B., Zhang, S., Nie, L., Liu, Y.: Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view syn- thesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

  80. [80]

    IEEE transactions on pat- tern analysis and machine intelligence44(6), 3170–3184 (2021)

    Zheng, Z., Yu, T., Liu, Y., Dai, Q.: Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. IEEE transactions on pat- tern analysis and machine intelligence44(6), 3170–3184 (2021)

Showing first 80 references.