EMOSH: Expressive Motion and Shape Disentanglement for Human Animation

Binquan Dai; Chen Li; Chuming Wang; Dongbin Zhang; Hao Liu; Haoqian Wang; Jing Lyu; Kangjie Chen

arxiv: 2606.28026 · v1 · pith:33OCJYWDnew · submitted 2026-06-26 · 💻 cs.CV

EMOSH: Expressive Motion and Shape Disentanglement for Human Animation

Dongbin Zhang , Hao Liu , Binquan Dai , Kangjie Chen , Chuming Wang , Chen Li , Jing Lyu , Haoqian Wang This is my paper

Pith reviewed 2026-06-29 04:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords human animationshape disentanglementmotion disentanglementvideo generationexpressive human modelpose estimationcross-driven animation

0 comments

The pith

Disentangling shape and pose parameters in an expressive human model stops body shape leakage during video animation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current 2D pose methods let the driving person's body shape leak into the generated video, while 3D-prior methods keep shape separate but produce stiff results that miss expressions. EMOSH builds an Expressive Human Model that separates shape parameters from pose parameters as the central control signal. A dedicated motion tracker pulls these parameters from any input video. Coarse-to-fine motion injection and spatially aligned conditioning then transfer expressions and gestures while preserving the target identity. The framework therefore supports both self-driven and cross-driven animation with higher fidelity and explicit shape control.

Core claim

The paper introduces the Expressive Human Model (EHM) whose explicit separation of shape and pose parameters removes motion-shape entanglement. A robust motion tracker extracts EHM parameters from video; Coarse-to-Fine Hybrid Motion Injection supplies fine-grained expression and gesture control; and Spatially-Aligned Conditioning closes the train-inference gap. These elements together generate high-fidelity videos that keep target body shape intact while transferring vivid expressions in both self-driven and cross-driven settings.

What carries the argument

The Expressive Human Model (EHM), which serves as the core control representation by explicitly separating shape parameters from pose parameters.

If this is right

Self-driven animation produces higher-fidelity output with preserved identity.
Cross-driven animation transfers expressions and gestures without transferring body shape.
Complex gestures and facial expressions remain controllable without the rigidity of pure 3D priors.
Identity consistency improves across training and inference through spatial alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Avatar pipelines could reduce manual corrections for shape mismatches during cross-identity transfers.
Real-time applications would need separate validation of the tracker's accuracy on live streams.
The same separation principle could be tested on full-body plus hand animation where current methods still leak proportions.

Load-bearing premise

A motion tracker can reliably extract accurate EHM parameters from any video input without errors that would reintroduce shape leakage or weaken expression control.

What would settle it

A cross-identity driving test in which the generated video visibly adopts the driving subject's body proportions instead of the target's proportions would show the disentanglement has failed.

Figures

Figures reproduced from arXiv: 2606.28026 by Binquan Dai, Chen Li, Chuming Wang, Dongbin Zhang, Hao Liu, Haoqian Wang, Jing Lyu, Kangjie Chen.

**Figure 1.** Figure 1: Given a reference image and a driving video, EMOSH achieves high-fidelity, mesh-guided expressive human animation while disentangling expressive motion from body shape to prevent shape leakage. Abstract. High-fidelity and expressive controllable human animation is essential for content creation and digital avatar applications. However, existing methods face a dilemma between expressiveness and disentangle… view at source ↗

**Figure 2.** Figure 2: First, the motion tracker extracts motion (θ d , ψd ) and camera (C d ) parameters from the driving video and shape parameters (β r b , βr f ) from the reference image, achieving motion-shape disentanglement via EHM retargeting. The retargeted model is then rendered into hybrid control signals through semantic color shading and keypoint drawing, and encoded into motion latents. During the generation phase… view at source ↗

**Figure 3.** Figure 3: (a) It extracts 2D/3D priors from the input video and dynamically filters unreliable guidance signals via Validity Gating, obtaining EHM parameters through joint iterative optimization. (b) For subsequent chunks in long video inference, the initial latent from the first chunk is injected as an additional spatially-aligned latent [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Visual results of our motion tracker. Our tracker accurately extracts expressive motion parameters across diverse scenarios. As demonstrated in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results on three datasets under the self-driven setting. EMOSH generates higher-fidelity videos and achieves more accurate control of facial expressions and hand gestures compared to the baselines. Cross-driven. Tab. 2 presents the Identity Preservation Score (IPS) under the Cross-ID setting. The results show that our approach achieves the highest IPS, verifying its superior robustness in prese… view at source ↗

**Figure 6.** Figure 6: Visual quality comparison under the cross-driven setting. While achieving precise motion control, our method better preserves the identity and body shape characteristics of the reference subject, effectively avoiding the shape distortion and limb artifacts seen in other methods. generally control expressions and gestures. However, they still suffer from incorrect mouth shapes, and blurry or merged finge… view at source ↗

**Figure 7.** Figure 7: Qualitative ablation results. (Left) Under the self-driven setting, full method enables more accurate control over expressions and gestures. (Right) Under the crossdriven setting, full method disentangles shape and motion, whereas removing SpatiallyAligned Conditioning (SAC) leads to artifacts and reduces identity preservation. intricate hand gestures. Furthermore, it suffers a significant performance dr… view at source ↗

**Figure 1.** Figure 1: Long-term identity consistency with and without SAC. over time reveals a widening performance gap between the "w/ SAC" and "w/o SAC" variants. Although varying video lengths cause minor statistical noise at the tail end, the increasing margin empirically indicates that our SpatiallyAligned Conditioning (SAC) module effectively mitigates identity consistency degradation in long-term generation. D.3 Trackin… view at source ↗

**Figure 2.** Figure 2: Visual comparison of tracking results. We compare our tracker with GUAVA across full-body, half-body, and head-only scenarios. As highlighted by the red dashed boxes, GUAVA is prone to foot/hand deviations, hallucinating occluded arms, and tracking collapse in extreme close-ups. Conversely, our method gracefully handles these complex scenes and severe occlusions, validating the effectiveness of our confide… view at source ↗

**Figure 3.** Figure 3: Zero-shot dynamic zoom trajectory. We demonstrate the model’s ability to perform simple camera movements (e.g., zoom-in and zoom-out) despite lacking explicit training on dynamic trajectories. By adjusting the rendering camera’s distance, the resulting scale variation of the condition mesh acts as a spatial cue, implicitly guiding the model to follow the intended camera shifts. subject to rotate in the gen… view at source ↗

**Figure 4.** Figure 4: Impact of anchor frame selection on generation [PITH_FULL_IMAGE:figures/full_fig_p030_4.png] view at source ↗

read the original abstract

High-fidelity and expressive controllable human animation is essential for content creation and digital avatar applications. However, existing methods face a dilemma between expressiveness and disentanglement. Mainstream 2D pose-conditioned approaches suffer from "motion-shape entanglement", leading to the leakage of the driving subject's body shape. Conversely, methods relying on 3D priors (e.g., SMPL) achieve geometric disentanglement but struggle to capture facial expressions and complex gestures, resulting in rigid animations. To this end, we propose EMOSH, a novel framework for high-fidelity controllable human video generation. First, an Expressive Human Model (EHM) is introduced as the core control representation. By explicitly disentangling shape and pose parameters, we fundamentally resolve the body shape leakage issue. Alongside this, a robust motion tracker is designed to accurately estimate EHM parameters from video. Second, we propose a Coarse-to-Fine Hybrid Motion Injection strategy, enabling more fine-grained control over expressions and gestures. Furthermore, we introduce a Spatially-Aligned Conditioning mechanism to bridge the domain gap between training and inference, improving identity consistency. Extensive experiments demonstrate that EMOSH outperforms previous methods in both self-driven and cross-driven scenarios, producing high-fidelity videos with vivid expressions while maintaining shape disentanglement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EMOSH introduces an Expressive Human Model plus two conditioning tricks to fix shape leakage in human video synthesis, but the whole thing depends on a motion tracker whose accuracy is not shown.

read the letter

The main point is that this paper puts forward an Expressive Human Model (EHM) that splits shape and pose parameters explicitly, along with a Coarse-to-Fine Hybrid Motion Injection and Spatially-Aligned Conditioning, to get both disentanglement and better expression control than either pure 2D pose or SMPL-style 3D methods.

The new elements are the EHM representation itself and the two named mechanisms. The abstract correctly identifies the usual tradeoff—2D methods leak driver shape into the output, while 3D priors lose facial nuance and complex gestures—and positions the new components as a practical workaround. If the full experiments hold up, the hybrid injection and alignment step could give animators finer control without the usual identity drift.

The soft spot is the motion tracker. The claim that explicit disentanglement “fundamentally resolves” leakage only works if the tracker extracts clean EHM parameters from arbitrary video. The abstract calls the tracker robust but supplies no error rates, no leakage measurements on diverse inputs, and no ablation on what happens when estimation is imperfect. Any shape or pose noise fed into the generator would recreate the entanglement the method is meant to eliminate. The experiments are described as outperforming prior work in self-driven and cross-driven settings, yet the abstract contains no numbers, baselines, or dataset details, so the size of the improvement cannot be judged.

This is for researchers building controllable human video pipelines who already deal with leakage and expression limits. A reader who needs a new control representation might find the specific combination worth testing.

I would send it to peer review. The framing is direct and the proposed pieces address a known failure mode; the tracker and quantitative results are the parts that need referee scrutiny.

Referee Report

2 major / 0 minor

Summary. The paper proposes EMOSH, a framework for high-fidelity controllable human video generation. It introduces an Expressive Human Model (EHM) that explicitly disentangles shape and pose parameters to resolve body shape leakage from motion-shape entanglement in prior 2D approaches, while addressing the limited expressiveness of 3D-prior methods like SMPL. A robust motion tracker estimates EHM parameters from video; a Coarse-to-Fine Hybrid Motion Injection strategy provides fine-grained control over expressions and gestures; and a Spatially-Aligned Conditioning mechanism improves identity consistency. The paper claims that EMOSH outperforms prior methods in both self-driven and cross-driven scenarios, yielding high-fidelity videos with vivid expressions and maintained shape disentanglement.

Significance. If the central claims hold with supporting quantitative evidence, the explicit disentanglement in EHM could meaningfully advance controllable human animation by overcoming the expressiveness-disentanglement trade-off, with potential applications in digital avatars and content creation. The introduction of the EHM representation and associated tracker represents a targeted architectural response to a documented limitation in the field.

major comments (2)

[Abstract] Abstract (second paragraph): The claim that EHM 'fundamentally resolve[s] the body shape leakage issue' by explicit disentanglement is load-bearing for the paper's central contribution, yet it depends on the unverified assumption that the accompanying motion tracker can estimate EHM parameters from arbitrary video input without errors that would reintroduce leakage or degrade control. No quantitative validation of tracker accuracy, failure modes, or leakage metrics is supplied.
[Abstract] Abstract (final sentence): The assertion that 'extensive experiments demonstrate that EMOSH outperforms previous methods' is central to the paper's evaluation claim, but the provided text supplies no metrics, baselines, dataset details, ablation results, or quantitative comparisons, making it impossible to assess whether the disentanglement and expressiveness improvements are realized in practice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each point below and agree to revise the abstract for improved clarity and precision while defending the manuscript's core claims based on the full text.

read point-by-point responses

Referee: [Abstract] Abstract (second paragraph): The claim that EHM 'fundamentally resolve[s] the body shape leakage issue' by explicit disentanglement is load-bearing for the paper's central contribution, yet it depends on the unverified assumption that the accompanying motion tracker can estimate EHM parameters from arbitrary video input without errors that would reintroduce leakage or degrade control. No quantitative validation of tracker accuracy, failure modes, or leakage metrics is supplied.

Authors: The EHM achieves disentanglement explicitly through its parameter design separating shape and pose, which directly addresses motion-shape entanglement in 2D methods by construction; tracker estimation errors do not reintroduce the same leakage mechanism. The full manuscript details the robust motion tracker with quantitative validation of accuracy, failure modes, and leakage metrics in Sections 3.2 and 5. We will revise the abstract wording to 'significantly mitigates' for precision. revision: partial
Referee: [Abstract] Abstract (final sentence): The assertion that 'extensive experiments demonstrate that EMOSH outperforms previous methods' is central to the paper's evaluation claim, but the provided text supplies no metrics, baselines, dataset details, ablation results, or quantitative comparisons, making it impossible to assess whether the disentanglement and expressiveness improvements are realized in practice.

Authors: Abstracts provide high-level overviews; the manuscript's Section 5 supplies the requested details including metrics, baselines, datasets, and ablations demonstrating outperformance in self- and cross-driven scenarios. To address the concern, we will revise the abstract's final sentence to reference key quantitative outcomes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on introduction of new disentangled model components

full rationale

The paper's central claims center on the introduction of the Expressive Human Model (EHM) with explicit shape-pose separation and an accompanying motion tracker, plus hybrid injection and conditioning mechanisms. No equations, fitted parameters, or self-citation chains are presented in the provided text that reduce any prediction or uniqueness result to its own inputs by construction. The resolution of shape leakage is asserted as a direct consequence of the explicit parameterization in the new model rather than a derived quantity that loops back to prior fitted values or self-referential definitions. This is the most common honest finding for a methods paper whose contributions are architectural rather than re-derivations of existing results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The abstract introduces the Expressive Human Model as a new control representation whose disentanglement property is asserted without upstream derivation; no free parameters, standard axioms, or additional invented entities are described.

invented entities (1)

Expressive Human Model (EHM) no independent evidence
purpose: Core control representation that explicitly disentangles shape and pose parameters to eliminate body shape leakage
Presented as the central new construct in the abstract; no independent evidence or prior citation supplied.

pith-pipeline@v0.9.1-grok · 5774 in / 1201 out tokens · 38107 ms · 2026-06-29T04:39:18.163074+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

91 extracted references · 31 canonical work pages · 16 internal anchors

[1]

OpenAI Blog1(8), 1 (2024) 3

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024) 3

2024
[2]

In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

Cao, C., Zhou, J., Li, S., Liang, J., Yu, C., Wang, F., Xue, X., Fu, Y.: Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video gen- eration. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–12 (2025) 4

2025
[3]

IEEE transactions on pattern analysis and machine intelligence43(1), 172–186 (2019) 4

Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: Realtime multi- person 2d pose estimation using part affinity fields. IEEE transactions on pattern analysis and machine intelligence43(1), 172–186 (2019) 4

2019
[4]

arXiv preprint arXiv:2311.12052 (2023) 2

Chang, D., Shi, Y., Gao, Q., Fu, J., Xu, H., Song, G., Yan, Q., Zhu, Y., Yang, X., Soleymani,M.:Magicpose:Realistichumanposesandfacialexpressionsretargeting with identity-aware diffusion. arXiv preprint arXiv:2311.12052 (2023) 2

work page arXiv 2023
[5]

arXiv preprint arXiv:2509.14055 (2025) 4, 9, 21

Cheng, G., Gao, X., Hu, L., Hu, S., Huang, M., Ji, C., Li, J., Meng, D., Qi, J., Qiao, P., et al.: Wan-animate: Unified character animation and replacement with holistic replication. arXiv preprint arXiv:2509.14055 (2025) 4, 9, 21

work page arXiv 2025
[6]

arXiv preprint arXiv:1907.06571 (2019) 3

Clark, A., Donahue, J., Simonyan, K.: Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571 (2019) 3

work page arXiv 1907
[7]

google / blog / genie - 3 - a - new - frontier - for - world - models/, ac- cessed: 2026-02-15 4

DeepMind, G.: Genie 3: A new frontier for world models (2025),https:// deepmind . google / blog / genie - 3 - a - new - frontier - for - world - models/, ac- cessed: 2026-02-15 4

2025
[8]

DeepMind, G.: Veo 3 (2025),https://deepmind.google/technologies/veo/, ac- cessed: 2026-02-15 4

2025
[9]

In: CVPR (2019) 10

Deng, J., Guo, J., Niannan, X., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR (2019) 10

2019
[10]

Advances in neural information processing systems31(2018) 4

Dong, H., Liang, X., Gong, K., Lai, H., Zhu, J., Yin, J.: Soft-gated warping-gan for pose-guided person image synthesis. Advances in neural information processing systems31(2018) 4

2018
[11]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Gao, Y., Guo, H., Hoang, T., Huang, W., Jiang, L., Kong, F., Li, H., Li, J., Li, L., Li, X., et al.: Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113 (2025) 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Communications of the ACM63(11), 139–144 (2020) 2

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM63(11), 139–144 (2020) 2

2020
[13]

Imagen Video: High Definition Video Generation with Diffusion Models

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022) 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Advances in neural information processing systems33, 6840–6851 (2020) 2, 3

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 2, 3

2020
[15]

Advances in neural information processing systems35, 8633– 8646 (2022) 3

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633– 8646 (2022) 3

2022
[16]

In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

Hong, F.T., Zhang, L., Shen, L., Xu, D.: Depth-aware generative adversarial net- work for talking head video generation. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 3397–3406 (2022) 4 16 D. Zhang et al

2022
[17]

Iclr1(2), 3 (2022) 22

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022) 22

2022
[18]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Hu, L.: Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8153–8163 (2024) 2, 4

2024
[19]

ACM Transactions on Graphics (TOG)44(6), 1–15 (2025) 4

Huang, T., Zheng, W., Wang, T., Liu, Y., Wang, Z., Wu, J., Jiang, J., Li, H., Lau, R., Zuo, W., et al.: Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation. ACM Transactions on Graphics (TOG)44(6), 1–15 (2025) 4

2025
[20]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jafarian, Y., Park, H.S.: Learning high fidelity depths of dressed humans by watch- ing social media dance videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12753–12762 (2021) 10

2021
[22]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025) 2

2025
[23]

ACM Trans

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G., et al.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023) 4

2023
[24]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 9

work page internal anchor Pith review Pith/arXiv arXiv 2014
[25]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

arXiv preprint arXiv:2506.172012(3), 6 (2025) 4

Li, J., Tang, J., Xu, Z., Wu, L., Zhou, Y., Shao, S., Yu, T., Cao, Z., Lu, Q.: Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition. arXiv preprint arXiv:2506.172012(3), 6 (2025) 4

work page arXiv 2025
[27]

ACM Transactions on Graphics, (Proc

Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia)36(6), 194:1–194:17 (2017),https://doi.org/10.1145/3130800.31308136, 23

work page doi:10.1145/3130800.31308136 2017
[28]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 9, 21

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 5115–5133 (2021) 4

Liu, W., Piao, Z., Tu, Z., Luo, W., Ma, L., Gao, S.: Liquid warping gan with attention: A unified framework for human image synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 5115–5133 (2021) 4

2021
[30]

arXiv preprint arXiv:2502.10982 (2025) 9, 24

Liu, Y., Zhu, L., Lin, L., Zhu, Y., Zhang, A., Li, Y.: Teaser: Token enhanced spatial modeling for expressions reconstruction. arXiv preprint arXiv:2502.10982 (2025) 9, 24

work page arXiv 2025
[31]

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinnedmulti-personlinearmodel.ACMTrans.Graphics(Proc.SIGGRAPHAsia) 34(6), 248:1–248:16 (Oct 2015) 2, 4, 6

2015
[32]

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Low, C., Wang, W., Katyal, C.: Ovi: Twin backbone cross-modal fusion for audio- video generation. arXiv preprint arXiv:2510.01284 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

MediaPipe: A Framework for Building Perception Pipelines

Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M.G., Lee, J., et al.: Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019) 9, 23

work page internal anchor Pith review Pith/arXiv arXiv 1906
[34]

In: Proceedings of EMOSH 17 the IEEE/CVF International Conference on Computer Vision

Luo, Y., Rong, Z., Wang, L., Zhang, L., Hu, T.: Dreamactor-m1: Holistic, expres- sive and robust human image animation with hybrid guidance. In: Proceedings of EMOSH 17 the IEEE/CVF International Conference on Computer Vision. pp. 11036–11046 (2025) 2

2025
[35]

Advances in neural information processing systems30 (2017) 4

Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L.: Pose guided person image generation. Advances in neural information processing systems30 (2017) 4

2017
[36]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Ma, Y., He, Y., Cun, X., Wang, X., Chen, S., Li, X., Chen, Q.: Follow your pose: Pose-guided text-to-video generation using pose-free videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4117–4125 (2024) 4

2024
[37]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Meng, R., Zhang, X., Li, Y., Ma, C.: Echomimicv2: Towards striking, simplified, and semi-body human animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5489–5498 (2025) 10

2025
[38]

Commu- nications of the ACM65(1), 99–106 (2021) 4

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu- nications of the ACM65(1), 99–106 (2021) 4

2021
[39]

com / MooreThreads / Moore-AnimateAnyone(2024), accessed: 2026-02-15 9

Moore Threads: Moore-animateanyone.https : / / github . com / MooreThreads / Moore-AnimateAnyone(2024), accessed: 2026-02-15 9

2024
[40]

Advances in neural information processing sys- tems32(2019) 9, 26

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high- performance deep learning library. Advances in neural information processing sys- tems32(2019) 9, 26

2019
[41]

In: Proceedings IEEE Conf

Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 10975–10985 (2019) 6, 23

2019
[42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pavlakos, G., Shan, D., Radosavovic, I., Kanazawa, A., Fouhey, D., Malik, J.: Reconstructing hands in 3d with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9826–9836 (2024) 9, 24

2024
[43]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 3, 5, 9, 22

2023
[44]

Qian, S., Kirschstein, T., Schoneveld, L., Davoli, D., Giebenhain, S., Nießner, M.: Gaussianavatars:Photorealisticheadavatarswithrigged3dgaussians.In:Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20299–20309 (2024) 4

2024
[45]

Geometry-Contrastive GAN for Facial Expression Transfer

Qiao, F., Yao, N., Jiao, Z., Li, Z., Chen, H., Wang, H.: Geometry-contrastive gan for facial expression transfer. arXiv preprint arXiv:1802.01822 (2018) 4

work page internal anchor Pith review Pith/arXiv arXiv 2018
[46]

arXiv preprint arXiv:2512.21338 (2025) 4, 8

Qiu, H., Liu, S., Zhou, Z., An, Z., Ren, W., Liu, Z., Schult, J., He, S., Chen, S., Cong, Y., et al.: Histream: Efficient high-resolution video generation via redundancy-eliminated streaming. arXiv preprint arXiv:2512.21338 (2025) 4, 8

work page arXiv 2025
[47]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Qiu, L., Gu, X., Li, P., Zuo, Q., Shen, W., Zhang, J., Qiu, K., Yuan, W., Chen, G., Dong, Z., et al.: Lhm: Large animatable human reconstruction model for single image to 3d in seconds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14184–14194 (2025) 4

2025
[48]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 5

2021
[49]

Journal of machine learning research21(140), 1–67 (2020) 5 18 D

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research21(140), 1–67 (2020) 5 18 D. Zhang et al

2020
[50]

In: International conference on machine learning

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: International conference on machine learning. pp. 8821–8831. Pmlr (2021) 3

2021
[51]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 3

2022
[52]

Runway: Introducing runway gen-4.5: A new frontier for video generation.https: //runwayml.com/research/introducing-runway-gen-4.5(2025), accessed: 2026- 02-15 3

2025
[53]

Advances in neural information processing systems35, 36479–36494 (2022) 3

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in neural information processing systems35, 36479–36494 (2022) 3

2022
[54]

ACM Transactions on Graphics (TOG)43(6), 1–13 (2024) 4

Shao, R., Pang, Y., Zheng, Z., Sun, J., Liu, Y.: 360-degree human video generation with 4d diffusion transformer. ACM Transactions on Graphics (TOG)43(6), 1–13 (2024) 4

2024
[55]

Shen, L., Qiao, Q., Yu, T., Zhou, K., Yu, T., Zhan, Y., Wang, Z., Tao, M., Yin, S., Liu, S.: Soulx-flashtalk: Real-time infinite streaming of audio-driven avatars via self-correcting bidirectional distillation (2025),https://arxiv.org/abs/2512. 233798

2025
[56]

In: SIGGRAPH Asia 2024 Conference Papers

Shen,Z.,Pi,H.,Xia,Y.,Cen,Z.,Peng,S.,Hu,Z.,Bao,H.,Hu,R.,Zhou,X.:World- grounded human motion recovery via gravity-view coordinates. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024) 9, 24

2024
[57]

Advances in neural information processing systems32 (2019) 2

Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. Advances in neural information processing systems32 (2019) 2

2019
[58]

In: Proceedings of the IEEE conference on com- puter vision and pattern recognition

Siarohin, A., Sangineto, E., Lathuiliere, S., Sebe, N.: Deformable gans for pose- based human image generation. In: Proceedings of the IEEE conference on com- puter vision and pattern recognition. pp. 3408–3416 (2018) 2

2018
[59]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text- video data. arXiv preprint arXiv:2209.14792 (2022) 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[60]

In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

Song, G., Xu, H., Zhao, X., Xie, Y., Gu, T., Li, Z., Zhang, C., Luo, L.: X-unimotion: Animating human images with expressive, unified and identity-agnostic motion latents. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–11 (2025) 2

2025
[61]

Denoising Diffusion Implicit Models

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020) 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[62]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020) 2

work page internal anchor Pith review Pith/arXiv arXiv 2011
[63]

Neurocomputing568, 127063 (2024) 4

Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing568, 127063 (2024) 4

2024
[64]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(4), 4682–4693 (2022) 4

Sun, Y.T., Fu, Q.C., Jiang, Y.R., Liu, Z., Lai, Y.K., Fu, H., Gao, L.: Human motion transfer with 3d constraints and detail enhancement. IEEE Transactions on Pattern Analysis and Machine Intelligence45(4), 4682–4693 (2022) 4

2022
[65]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Tu, S., Xing, Z., Han, X., Cheng, Z.Q., Dai, Q., Luo, C., Wu, Z.: Stableanimator: High-quality identity-preserving human image animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 21096–21106 (2025) 4, 9 EMOSH 19

2025
[66]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: Mocogan: Decomposing motion and content for video generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1526–1535 (2018) 3

2018
[67]

Phenaki: Variable Length Video Generation From Open Domain Textual Description

Villegas, R., Babaeizadeh, M., Kindermans, P.J., Moraldo, H., Zhang, H., Saffar, M.T., Castro, S., Kunze, J., Erhan, D.: Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399 (2022) 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[68]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 4, 5, 8, 9, 22

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang,T.,Li,L.,Lin,K.,Zhai,Y.,Lin,C.C.,Yang,Z.,Zhang,H.,Liu,Z.,Wang,L.: Disco: Disentangled control for realistic human dance generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9326–9336 (2024) 4

2024
[70]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, T.C., Mallya, A., Liu, M.Y.: One-shot free-view neural talking-head syn- thesis for video conferencing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10039–10049 (2021) 4

2021
[71]

Science China Information Sciences68(10), 200103 (2025) 4, 10, 21

Wang, X., Zhang, S., Gao, C., Wang, J., Zhou, X., Zhang, Y., Yan, L., Sang, N.: Unianimate: Taming unified video diffusion models for consistent human image animation. Science China Information Sciences68(10), 200103 (2025) 4, 10, 21

2025
[72]

IEEE Transactions on Multimedia23, 2457–2470 (2020) 2, 4

Wei, D., Xu, X., Shen, H., Huang, K.: Gac-gan: A general method for appearance- controllable human video motion transfer. IEEE Transactions on Multimedia23, 2457–2470 (2020) 2, 4

2020
[73]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xie, L., Wang, X., Zhang, H., Dong, C., Shan, Y.: Vfhq: A high-quality dataset and benchmark for video face super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 657–666 (2022) 9

2022
[74]

arXiv preprint arXiv:2505.22977 (2025) 2, 4, 9, 21

Xu, S., Zheng, S., Wang, Z., Yu, H., Chen, J., Zhang, H., Li, B., Jiang, P.T.: Hypermotion: Dit-based pose-guided human image animation of complex motions. arXiv preprint arXiv:2505.22977 (2025) 2, 4, 9, 21

work page arXiv 2025
[75]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xu, Z., Zhang, J., Liew, J.H., Yan, H., Liu, J.W., Zhang, C., Feng, J., Shou, M.Z.: Magicanimate: Temporally consistent human image animation using diffu- sion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1481–1490 (2024) 4

2024
[76]

VideoGPT: Video Generation using VQ-VAE and Transformers

Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157 (2021) 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[77]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Yang, Z., Zeng, A., Yuan, C., Li, Y.: Effective whole-body pose estimation with two-stages distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4210–4220 (2023) 4, 9, 23

2023
[78]

arXiv preprint arXiv:2512.05081 (2025) 4

Yi, J., Jang, W., Cho, P.H., Nam, J., Yoon, H., Kim, S.: Deep forcing: Training-free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081 (2025) 4

work page arXiv 2025
[79]

Zhang, D., Liu, Y., Lin, L., Zhu, Y., Chen, K., Qin, M., Li, Y., Wang, H.: Hravatar: High-qualityandrelightablegaussianheadavatar.In:ProceedingsoftheComputer Vision and Pattern Recognition Conference. pp. 26285–26296 (2025) 4

2025
[80]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhang, D., Liu, Y., Lin, L., Zhu, Y., Li, Y., Qin, M., Li, Y., Wang, H.: Guava: Generalizable upper body 3d gaussian avatar. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14205–14217 (2025) 2, 4, 6, 21, 23, 27

2025

Showing first 80 references.

[1] [1]

OpenAI Blog1(8), 1 (2024) 3

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024) 3

2024

[2] [2]

In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

Cao, C., Zhou, J., Li, S., Liang, J., Yu, C., Wang, F., Xue, X., Fu, Y.: Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video gen- eration. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–12 (2025) 4

2025

[3] [3]

IEEE transactions on pattern analysis and machine intelligence43(1), 172–186 (2019) 4

Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: Realtime multi- person 2d pose estimation using part affinity fields. IEEE transactions on pattern analysis and machine intelligence43(1), 172–186 (2019) 4

2019

[4] [4]

arXiv preprint arXiv:2311.12052 (2023) 2

Chang, D., Shi, Y., Gao, Q., Fu, J., Xu, H., Song, G., Yan, Q., Zhu, Y., Yang, X., Soleymani,M.:Magicpose:Realistichumanposesandfacialexpressionsretargeting with identity-aware diffusion. arXiv preprint arXiv:2311.12052 (2023) 2

work page arXiv 2023

[5] [5]

arXiv preprint arXiv:2509.14055 (2025) 4, 9, 21

Cheng, G., Gao, X., Hu, L., Hu, S., Huang, M., Ji, C., Li, J., Meng, D., Qi, J., Qiao, P., et al.: Wan-animate: Unified character animation and replacement with holistic replication. arXiv preprint arXiv:2509.14055 (2025) 4, 9, 21

work page arXiv 2025

[6] [6]

arXiv preprint arXiv:1907.06571 (2019) 3

Clark, A., Donahue, J., Simonyan, K.: Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571 (2019) 3

work page arXiv 1907

[7] [7]

google / blog / genie - 3 - a - new - frontier - for - world - models/, ac- cessed: 2026-02-15 4

DeepMind, G.: Genie 3: A new frontier for world models (2025),https:// deepmind . google / blog / genie - 3 - a - new - frontier - for - world - models/, ac- cessed: 2026-02-15 4

2025

[8] [8]

DeepMind, G.: Veo 3 (2025),https://deepmind.google/technologies/veo/, ac- cessed: 2026-02-15 4

2025

[9] [9]

In: CVPR (2019) 10

Deng, J., Guo, J., Niannan, X., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR (2019) 10

2019

[10] [10]

Advances in neural information processing systems31(2018) 4

Dong, H., Liang, X., Gong, K., Lai, H., Zhu, J., Yin, J.: Soft-gated warping-gan for pose-guided person image synthesis. Advances in neural information processing systems31(2018) 4

2018

[11] [11]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Gao, Y., Guo, H., Hoang, T., Huang, W., Jiang, L., Kong, F., Li, H., Li, J., Li, L., Li, X., et al.: Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113 (2025) 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Communications of the ACM63(11), 139–144 (2020) 2

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM63(11), 139–144 (2020) 2

2020

[13] [13]

Imagen Video: High Definition Video Generation with Diffusion Models

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022) 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

Advances in neural information processing systems33, 6840–6851 (2020) 2, 3

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 2, 3

2020

[15] [15]

Advances in neural information processing systems35, 8633– 8646 (2022) 3

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633– 8646 (2022) 3

2022

[16] [16]

In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

Hong, F.T., Zhang, L., Shen, L., Xu, D.: Depth-aware generative adversarial net- work for talking head video generation. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 3397–3406 (2022) 4 16 D. Zhang et al

2022

[17] [17]

Iclr1(2), 3 (2022) 22

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022) 22

2022

[18] [18]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Hu, L.: Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8153–8163 (2024) 2, 4

2024

[19] [19]

ACM Transactions on Graphics (TOG)44(6), 1–15 (2025) 4

Huang, T., Zheng, W., Wang, T., Liu, Y., Wang, Z., Wu, J., Jiang, J., Li, H., Lau, R., Zuo, W., et al.: Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation. ACM Transactions on Graphics (TOG)44(6), 1–15 (2025) 4

2025

[20] [20]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jafarian, Y., Park, H.S.: Learning high fidelity depths of dressed humans by watch- ing social media dance videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12753–12762 (2021) 10

2021

[22] [22]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025) 2

2025

[23] [23]

ACM Trans

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G., et al.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023) 4

2023

[24] [24]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 9

work page internal anchor Pith review Pith/arXiv arXiv 2014

[25] [25]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

arXiv preprint arXiv:2506.172012(3), 6 (2025) 4

Li, J., Tang, J., Xu, Z., Wu, L., Zhou, Y., Shao, S., Yu, T., Cao, Z., Lu, Q.: Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition. arXiv preprint arXiv:2506.172012(3), 6 (2025) 4

work page arXiv 2025

[27] [27]

ACM Transactions on Graphics, (Proc

Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia)36(6), 194:1–194:17 (2017),https://doi.org/10.1145/3130800.31308136, 23

work page doi:10.1145/3130800.31308136 2017

[28] [28]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 9, 21

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 5115–5133 (2021) 4

Liu, W., Piao, Z., Tu, Z., Luo, W., Ma, L., Gao, S.: Liquid warping gan with attention: A unified framework for human image synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 5115–5133 (2021) 4

2021

[30] [30]

arXiv preprint arXiv:2502.10982 (2025) 9, 24

Liu, Y., Zhu, L., Lin, L., Zhu, Y., Zhang, A., Li, Y.: Teaser: Token enhanced spatial modeling for expressions reconstruction. arXiv preprint arXiv:2502.10982 (2025) 9, 24

work page arXiv 2025

[31] [31]

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinnedmulti-personlinearmodel.ACMTrans.Graphics(Proc.SIGGRAPHAsia) 34(6), 248:1–248:16 (Oct 2015) 2, 4, 6

2015

[32] [32]

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Low, C., Wang, W., Katyal, C.: Ovi: Twin backbone cross-modal fusion for audio- video generation. arXiv preprint arXiv:2510.01284 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

MediaPipe: A Framework for Building Perception Pipelines

Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M.G., Lee, J., et al.: Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019) 9, 23

work page internal anchor Pith review Pith/arXiv arXiv 1906

[34] [34]

In: Proceedings of EMOSH 17 the IEEE/CVF International Conference on Computer Vision

Luo, Y., Rong, Z., Wang, L., Zhang, L., Hu, T.: Dreamactor-m1: Holistic, expres- sive and robust human image animation with hybrid guidance. In: Proceedings of EMOSH 17 the IEEE/CVF International Conference on Computer Vision. pp. 11036–11046 (2025) 2

2025

[35] [35]

Advances in neural information processing systems30 (2017) 4

Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L.: Pose guided person image generation. Advances in neural information processing systems30 (2017) 4

2017

[36] [36]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Ma, Y., He, Y., Cun, X., Wang, X., Chen, S., Li, X., Chen, Q.: Follow your pose: Pose-guided text-to-video generation using pose-free videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4117–4125 (2024) 4

2024

[37] [37]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Meng, R., Zhang, X., Li, Y., Ma, C.: Echomimicv2: Towards striking, simplified, and semi-body human animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5489–5498 (2025) 10

2025

[38] [38]

Commu- nications of the ACM65(1), 99–106 (2021) 4

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu- nications of the ACM65(1), 99–106 (2021) 4

2021

[39] [39]

com / MooreThreads / Moore-AnimateAnyone(2024), accessed: 2026-02-15 9

Moore Threads: Moore-animateanyone.https : / / github . com / MooreThreads / Moore-AnimateAnyone(2024), accessed: 2026-02-15 9

2024

[40] [40]

Advances in neural information processing sys- tems32(2019) 9, 26

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high- performance deep learning library. Advances in neural information processing sys- tems32(2019) 9, 26

2019

[41] [41]

In: Proceedings IEEE Conf

Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 10975–10985 (2019) 6, 23

2019

[42] [42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pavlakos, G., Shan, D., Radosavovic, I., Kanazawa, A., Fouhey, D., Malik, J.: Reconstructing hands in 3d with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9826–9836 (2024) 9, 24

2024

[43] [43]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 3, 5, 9, 22

2023

[44] [44]

Qian, S., Kirschstein, T., Schoneveld, L., Davoli, D., Giebenhain, S., Nießner, M.: Gaussianavatars:Photorealisticheadavatarswithrigged3dgaussians.In:Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20299–20309 (2024) 4

2024

[45] [45]

Geometry-Contrastive GAN for Facial Expression Transfer

Qiao, F., Yao, N., Jiao, Z., Li, Z., Chen, H., Wang, H.: Geometry-contrastive gan for facial expression transfer. arXiv preprint arXiv:1802.01822 (2018) 4

work page internal anchor Pith review Pith/arXiv arXiv 2018

[46] [46]

arXiv preprint arXiv:2512.21338 (2025) 4, 8

Qiu, H., Liu, S., Zhou, Z., An, Z., Ren, W., Liu, Z., Schult, J., He, S., Chen, S., Cong, Y., et al.: Histream: Efficient high-resolution video generation via redundancy-eliminated streaming. arXiv preprint arXiv:2512.21338 (2025) 4, 8

work page arXiv 2025

[47] [47]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Qiu, L., Gu, X., Li, P., Zuo, Q., Shen, W., Zhang, J., Qiu, K., Yuan, W., Chen, G., Dong, Z., et al.: Lhm: Large animatable human reconstruction model for single image to 3d in seconds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14184–14194 (2025) 4

2025

[48] [48]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 5

2021

[49] [49]

Journal of machine learning research21(140), 1–67 (2020) 5 18 D

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research21(140), 1–67 (2020) 5 18 D. Zhang et al

2020

[50] [50]

In: International conference on machine learning

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: International conference on machine learning. pp. 8821–8831. Pmlr (2021) 3

2021

[51] [51]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 3

2022

[52] [52]

Runway: Introducing runway gen-4.5: A new frontier for video generation.https: //runwayml.com/research/introducing-runway-gen-4.5(2025), accessed: 2026- 02-15 3

2025

[53] [53]

Advances in neural information processing systems35, 36479–36494 (2022) 3

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in neural information processing systems35, 36479–36494 (2022) 3

2022

[54] [54]

ACM Transactions on Graphics (TOG)43(6), 1–13 (2024) 4

Shao, R., Pang, Y., Zheng, Z., Sun, J., Liu, Y.: 360-degree human video generation with 4d diffusion transformer. ACM Transactions on Graphics (TOG)43(6), 1–13 (2024) 4

2024

[55] [55]

Shen, L., Qiao, Q., Yu, T., Zhou, K., Yu, T., Zhan, Y., Wang, Z., Tao, M., Yin, S., Liu, S.: Soulx-flashtalk: Real-time infinite streaming of audio-driven avatars via self-correcting bidirectional distillation (2025),https://arxiv.org/abs/2512. 233798

2025

[56] [56]

In: SIGGRAPH Asia 2024 Conference Papers

Shen,Z.,Pi,H.,Xia,Y.,Cen,Z.,Peng,S.,Hu,Z.,Bao,H.,Hu,R.,Zhou,X.:World- grounded human motion recovery via gravity-view coordinates. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024) 9, 24

2024

[57] [57]

Advances in neural information processing systems32 (2019) 2

Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. Advances in neural information processing systems32 (2019) 2

2019

[58] [58]

In: Proceedings of the IEEE conference on com- puter vision and pattern recognition

Siarohin, A., Sangineto, E., Lathuiliere, S., Sebe, N.: Deformable gans for pose- based human image generation. In: Proceedings of the IEEE conference on com- puter vision and pattern recognition. pp. 3408–3416 (2018) 2

2018

[59] [59]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text- video data. arXiv preprint arXiv:2209.14792 (2022) 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[60] [60]

In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

Song, G., Xu, H., Zhao, X., Xie, Y., Gu, T., Li, Z., Zhang, C., Luo, L.: X-unimotion: Animating human images with expressive, unified and identity-agnostic motion latents. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–11 (2025) 2

2025

[61] [61]

Denoising Diffusion Implicit Models

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020) 3

work page internal anchor Pith review Pith/arXiv arXiv 2010

[62] [62]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020) 2

work page internal anchor Pith review Pith/arXiv arXiv 2011

[63] [63]

Neurocomputing568, 127063 (2024) 4

Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing568, 127063 (2024) 4

2024

[64] [64]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(4), 4682–4693 (2022) 4

Sun, Y.T., Fu, Q.C., Jiang, Y.R., Liu, Z., Lai, Y.K., Fu, H., Gao, L.: Human motion transfer with 3d constraints and detail enhancement. IEEE Transactions on Pattern Analysis and Machine Intelligence45(4), 4682–4693 (2022) 4

2022

[65] [65]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Tu, S., Xing, Z., Han, X., Cheng, Z.Q., Dai, Q., Luo, C., Wu, Z.: Stableanimator: High-quality identity-preserving human image animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 21096–21106 (2025) 4, 9 EMOSH 19

2025

[66] [66]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: Mocogan: Decomposing motion and content for video generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1526–1535 (2018) 3

2018

[67] [67]

Phenaki: Variable Length Video Generation From Open Domain Textual Description

Villegas, R., Babaeizadeh, M., Kindermans, P.J., Moraldo, H., Zhang, H., Saffar, M.T., Castro, S., Kunze, J., Erhan, D.: Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399 (2022) 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[68] [68]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 4, 5, 8, 9, 22

work page internal anchor Pith review Pith/arXiv arXiv 2025

[69] [69]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang,T.,Li,L.,Lin,K.,Zhai,Y.,Lin,C.C.,Yang,Z.,Zhang,H.,Liu,Z.,Wang,L.: Disco: Disentangled control for realistic human dance generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9326–9336 (2024) 4

2024

[70] [70]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, T.C., Mallya, A., Liu, M.Y.: One-shot free-view neural talking-head syn- thesis for video conferencing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10039–10049 (2021) 4

2021

[71] [71]

Science China Information Sciences68(10), 200103 (2025) 4, 10, 21

Wang, X., Zhang, S., Gao, C., Wang, J., Zhou, X., Zhang, Y., Yan, L., Sang, N.: Unianimate: Taming unified video diffusion models for consistent human image animation. Science China Information Sciences68(10), 200103 (2025) 4, 10, 21

2025

[72] [72]

IEEE Transactions on Multimedia23, 2457–2470 (2020) 2, 4

Wei, D., Xu, X., Shen, H., Huang, K.: Gac-gan: A general method for appearance- controllable human video motion transfer. IEEE Transactions on Multimedia23, 2457–2470 (2020) 2, 4

2020

[73] [73]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xie, L., Wang, X., Zhang, H., Dong, C., Shan, Y.: Vfhq: A high-quality dataset and benchmark for video face super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 657–666 (2022) 9

2022

[74] [74]

arXiv preprint arXiv:2505.22977 (2025) 2, 4, 9, 21

Xu, S., Zheng, S., Wang, Z., Yu, H., Chen, J., Zhang, H., Li, B., Jiang, P.T.: Hypermotion: Dit-based pose-guided human image animation of complex motions. arXiv preprint arXiv:2505.22977 (2025) 2, 4, 9, 21

work page arXiv 2025

[75] [75]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xu, Z., Zhang, J., Liew, J.H., Yan, H., Liu, J.W., Zhang, C., Feng, J., Shou, M.Z.: Magicanimate: Temporally consistent human image animation using diffu- sion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1481–1490 (2024) 4

2024

[76] [76]

VideoGPT: Video Generation using VQ-VAE and Transformers

Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157 (2021) 3

work page internal anchor Pith review Pith/arXiv arXiv 2021

[77] [77]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Yang, Z., Zeng, A., Yuan, C., Li, Y.: Effective whole-body pose estimation with two-stages distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4210–4220 (2023) 4, 9, 23

2023

[78] [78]

arXiv preprint arXiv:2512.05081 (2025) 4

Yi, J., Jang, W., Cho, P.H., Nam, J., Yoon, H., Kim, S.: Deep forcing: Training-free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081 (2025) 4

work page arXiv 2025

[79] [79]

Zhang, D., Liu, Y., Lin, L., Zhu, Y., Chen, K., Qin, M., Li, Y., Wang, H.: Hravatar: High-qualityandrelightablegaussianheadavatar.In:ProceedingsoftheComputer Vision and Pattern Recognition Conference. pp. 26285–26296 (2025) 4

2025

[80] [80]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhang, D., Liu, Y., Lin, L., Zhu, Y., Li, Y., Qin, M., Li, Y., Wang, H.: Guava: Generalizable upper body 3d gaussian avatar. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14205–14217 (2025) 2, 4, 6, 21, 23, 27

2025