pith. machine review for the scientific record. sign in

arxiv: 2604.19720 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords human video generationcontrollable synthesisimage-first approachSMPL-X guidancevideo diffusion refinementpose and viewpoint controltemporal consistency
0
0 comments X

The pith

High-quality controllable human videos are generated by first creating appearance via image models then applying motion and temporal refinement without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that human video generation struggles when appearance, motion, and viewpoint are modeled jointly under scarce multi-view data. Instead, it decouples the problem by first training a high-quality image generator on human appearance, then guiding that output with SMPL-X pose and viewpoint controls, and finally applying a training-free refinement step drawn from a pretrained video diffusion model to enforce temporal consistency. This image-first pipeline aims to deliver videos that maintain visual fidelity while allowing diverse pose and camera control. The authors support the approach with a new canonical human dataset and an auxiliary compositional image model.

Core claim

By treating high-quality human appearance as a prior learned through image generation and then layering SMPL-X-based motion guidance plus training-free temporal refinement from a video diffusion model, the method produces temporally consistent, high-quality videos under varied poses and viewpoints without requiring joint end-to-end training on video data.

What carries the argument

A pretrained image backbone that supplies appearance priors, combined with SMPL-X parametric body guidance for pose and viewpoint control, followed by a training-free temporal refinement stage that uses a separate pretrained video diffusion model.

If this is right

  • Videos can be produced with independent control over identity appearance, body pose sequence, and camera trajectory.
  • No video-specific fine-tuning is needed once the image backbone and refinement model are pretrained.
  • New human identities can be introduced by swapping the image-generation stage while keeping the same motion and refinement pipeline.
  • The released canonical dataset enables direct comparison of appearance priors across methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same image-first separation could extend to other domains such as animal or object video generation where multi-view video data is scarce.
  • If the refinement stage can be made conditional on additional signals, finer control over lighting or clothing dynamics might become possible without retraining.
  • Compositional image models released with the paper could allow mixing body parts or outfits at the appearance stage before motion is applied.

Load-bearing premise

High-quality appearance learned only from still images can transfer directly to video synthesis when guided by SMPL-X and refined with an off-the-shelf video model, without any joint training on video data.

What would settle it

Generate videos of the same person in extreme novel viewpoints or rapid pose transitions; if visible artifacts, identity drift, or temporal flickering appear at rates higher than competing joint-training methods, the image-first prior fails to carry over effectively.

Figures

Figures reproduced from arXiv: 2604.19720 by Chenghong Li, Heyuan Li, Hongjie Liao, Keru Zheng, Shuguang Cui, Shuliang Ning, Xiaoguang Han, Xihe Yang, Yihao Zhi, Zhengwentai Sun.

Figure 1
Figure 1. Figure 1: Our method enables controllable human synthesis at multiple levels. (a) Our pipeline generates temporally coherent videos with explicit control over body pose and camera viewpoint. (b) Our image model generalizes to in-the-wild references, producing diverse poses and viewpoints with consistent appearance. (c) As an additional contribu￾tion, our end-to-end model supports compositional synthesis with disenta… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our image-first training and inference paradigm. (a) During train￾ing, a powerful pretrained image backbone is fine-tuned via lightweight LoRA adap￾tation using an imperfect multi-view dataset with structured pose and viewpoint su￾pervision. (b) At inference, the fine-tuned model generalizes to high-quality canonical inputs and enables pose- and viewpoint-controllable human synthesis. This high… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed pose- and view-guided generation module. SMPL￾X–based pose and canonical front/back appearance cues are unified in a token sequence and processed by a DiT backbone with condition-aware RoPE. The final image is obtained by decoding the generated latent using a VAE decoder (not shown for clarity). 3 Method 3.1 Overview We propose ReImagine, an image-first framework for human video sy… view at source ↗
Figure 4
Figure 4. Figure 4: Training-free temporal consistency via low-noise re-denoising and spatiotempo￾ral spectral regularization. Boxes highlight regions with improved temporal consistency. with the same transformation applied to kj . Attention is computed as \mathrm {Attn}(\mathbf {q}_i, \mathbf {k}_j) = \frac { \mathrm {RoPE}(\mathbf {q}_i; \mathbf {p}_i)^\top \mathrm {RoPE}(\mathbf {k}_j; \mathbf {p}_j) }{ \sqrt {d} }. (7) In… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison for image-to-video human synthesis on the MVHu￾manNet++ dataset [25]. We compare our method with Wan-Fun [1], Wan-Animate (Wan-Ani) [3], Qwen [47], and Human4DiT [39]. The ground truth (GT) is shown in the first column. GT Ours Wan-Fun Wan-Ani Qwen Human4DiT GT Ours Wan-Fun Wan-Ani Qwen Human4DiT [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison on the DNA-Rendering dataset [6]. Our method is evaluated in a zero-shot setting without training on this dataset, demonstrating strong generalization under more challenging viewpoints training-free manner, and no parameters are optimized. The number of diffusion inference steps is set to 20 for both the image synthesis module and the training￾free temporal consistency module. During… view at source ↗
Figure 7
Figure 7. Figure 7: Temporal consistency ablation via tracking visualization. In contrast, our method achieves substantially better temporal consistency, obtaining an FVD of 0.275 on MVHumanNet compared to 0.403 for Wan￾Animate. Although Qwen-Image-Edit attains a relatively high SSIM (0.831), this metric is misleading for video generation: it produces high-quality individ￾ual frames but lacks temporal coherence, resulting in … view at source ↗
Figure 8
Figure 8. Figure 8: Ablation on missing back-view appearance input. The two leftmost columns show the input appearance images, where the back-view input is replaced by a blank image in row (a) [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison between our image-first method and a video-first base￾line (Uni-Animate DiT). The leftmost column shows the canonical front reference input (back reference omitted for space). Each row corresponds to the same subject under dif￾ferent poses. Red dashed boxes highlight results generated by our image-first pipeline, while the remaining images are produced by the video-first baseline. Ou… view at source ↗
Figure 10
Figure 10. Figure 10: Data construction pipeline for building canonical and disentangled human assets from MVHumanNet [48] Input\pose [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: More results produced by our end-to-end method. stage only needs to enforce motion coherence across independently generated frames, rather than learning full video synthesis from limited video data. Discussion. These results support our motivation to formulate controllable hu￾man video generation as a motion-guided image synthesis problem. By relying on strong image generation priors and lightweight tempo… view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative results on in-the-wild appearance inputs. Given a single reference image and SMPL-X pose sequences, our model generates consistent human images under diverse poses and viewpoints while preserving identity and clothing details [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative results under large viewpoint changes. Starting from a canoni￾cal appearance input, the model synthesizes consistent human images as the camera viewpoint rotates around the subject, demonstrating strong control over viewpoint and pose. We present additional qualitative results to further demonstrate the gener￾alization ability and controllability of our framework. As shown in [PITH_FULL_IMAGE… view at source ↗
Figure 14
Figure 14. Figure 14: Examples of additional applications enabled by our canonical asset-based training. The model can synthesize humans under diverse poses, combine identity and clothing assets for creative generation, and control pose under arbitrary viewpoints. model can robustly handle in-the-wild appearance inputs and synthesize consis￾tent human images under diverse poses and viewpoints. Despite the variability in real-w… view at source ↗
read the original abstract

Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce ReImagine, an image-first pipeline for pose- and viewpoint-controllable high-quality human video generation. High-quality appearance is learned via a pretrained image backbone and used as a prior; motion and viewpoint are controlled via SMPL-X guidance; temporal consistency is achieved through a training-free refinement stage that employs a pretrained video diffusion model. The authors also release a canonical human dataset and an auxiliary model for compositional human image synthesis, with code and data made publicly available.

Significance. If the central claims hold, the work would be significant for its modular decoupling of appearance modeling (via image priors) from motion and temporal consistency, potentially reducing reliance on scarce multi-view video data. The public release of the dataset, auxiliary model, and code is a clear strength that supports reproducibility and downstream research in controllable human video synthesis.

major comments (2)
  1. [§3] §3 (Method), temporal refinement stage: the claim that a training-free video diffusion model can reliably resolve frame-to-frame appearance/lighting mismatches induced by SMPL-X pose and viewpoint changes (without joint training or fine-tuning) is load-bearing for the temporal-consistency guarantee, yet the manuscript provides no ablations or failure-case analysis on distribution shift between the image-backbone outputs and the diffusion model's training distribution.
  2. [§4] §4 (Experiments): quantitative support for the claim of 'high-quality, temporally consistent videos under diverse poses and viewpoints' is not fully detailed in the provided description; without reported metrics (e.g., FVD, temporal consistency scores) and direct comparisons against jointly-trained baselines on the same SMPL-X-driven test cases, the superiority of the image-first + training-free approach cannot be assessed.
minor comments (2)
  1. The abstract and method overview would benefit from a concise diagram or pseudocode summarizing the three-stage pipeline (image backbone → SMPL-X guidance → training-free refinement) to clarify data flow and conditioning.
  2. Notation for SMPL-X parameters (pose, shape, viewpoint) should be explicitly defined in the first use within the method section to avoid ambiguity for readers unfamiliar with the exact parameterization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work's significance, the recognition of our public releases, and the constructive feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Method), temporal refinement stage: the claim that a training-free video diffusion model can reliably resolve frame-to-frame appearance/lighting mismatches induced by SMPL-X pose and viewpoint changes (without joint training or fine-tuning) is load-bearing for the temporal-consistency guarantee, yet the manuscript provides no ablations or failure-case analysis on distribution shift between the image-backbone outputs and the diffusion model's training distribution.

    Authors: We agree that additional ablations and failure-case analysis for the training-free temporal refinement stage would better substantiate the claims. In the revised manuscript, we will add experiments ablating the refinement stage (with/without it) across diverse SMPL-X pose and viewpoint changes, including quantitative measures of mismatch resolution and qualitative discussion of remaining failure cases due to distribution shifts between the image backbone outputs and the video diffusion model's training data. revision: yes

  2. Referee: [§4] §4 (Experiments): quantitative support for the claim of 'high-quality, temporally consistent videos under diverse poses and viewpoints' is not fully detailed in the provided description; without reported metrics (e.g., FVD, temporal consistency scores) and direct comparisons against jointly-trained baselines on the same SMPL-X-driven test cases, the superiority of the image-first + training-free approach cannot be assessed.

    Authors: We acknowledge that expanded quantitative evaluation would allow better assessment of the approach. The revised experiments section will report standard metrics such as Fréchet Video Distance (FVD) and temporal consistency scores. We will also add direct comparisons to relevant jointly-trained baselines on the same SMPL-X-driven test cases, using our publicly released canonical human dataset and code to support reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline assembles independent pretrained components

full rationale

The paper presents a pipeline that decouples appearance modeling (via a pretrained image backbone) from motion (via SMPL-X guidance) and applies a separate training-free refinement stage using an off-the-shelf pretrained video diffusion model. No equations, fitted parameters, or self-citations are shown that would reduce any claimed prediction or result to the inputs by construction. The central claims rest on the empirical combination of existing models rather than on any self-definitional loop, uniqueness theorem imported from the authors' prior work, or renaming of known results. This is the standard case of an engineering synthesis paper whose derivation chain remains externally grounded.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The central claim rests on assumptions about the compatibility and effectiveness of pretrained diffusion models and the SMPL-X body model for decoupling appearance and motion; no new entities are invented.

axioms (3)
  • domain assumption Pretrained image diffusion models provide high-quality human appearance priors suitable for video synthesis.
    Core to the image-first decoupling strategy described in the abstract.
  • domain assumption SMPL-X-based guidance is sufficient to control diverse poses and viewpoints in the generated video.
    Invoked for the controllable pipeline stage.
  • domain assumption A training-free refinement stage using a pretrained video diffusion model can enforce temporal consistency.
    Key premise for the final stage without additional training.

pith-pipeline@v0.9.0 · 5478 in / 1538 out tokens · 53010 ms · 2026-05-10T02:22:15.304373+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 24 canonical work pages · 12 internal anchors

  1. [1]

    Apache-2.0 license

    Alibaba-PAI: Wan2.1-fun-v1.1-14b-control.https://huggingface.co/alibaba- pai/Wan2.1-Fun-V1.1-14B-Control(2025), hugging Face Model, accessed: 2025- 01-21. Apache-2.0 license. 2, 10

  2. [2]

    HuMo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025

    Chen, L., Ma, T., Liu, J., Li, B., Chen, Z., Liu, L., He, X., Li, G., He, Q., Wu, Z.: Humo:Human-centricvideogenerationviacollaborativemulti-modalconditioning. arXiv preprint arXiv:2509.08519 (2025) 2, 5

  3. [4]

    CoRR , volume =

    Cheng, G., Gao, X., Hu, L., Hu, S., Huang, M., Ji, C., Li, J., Meng, D., Qi, J., Qiao, P., et al.: Wan-animate: Unified character animation and replacement with holistic replication. arXiv preprint arXiv:2509.14055 (2025) 2, 10

  4. [5]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., Shan, Y.: Yolo-world: Real-time open-vocabulary object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16901–16911 (2024) 16

  5. [6]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Cheng, W., Chen, R., Fan, S., Yin, W., Chen, K., Cai, Z., Wang, J., Gao, Y., Yu, Z., Lin, Z., Ren, D., Yang, L., Liu, Z., Loy, C.C., Qian, C., Wu, W., Lin, D., Dai, B., Lin, K.Y.: Dna-rendering: A diverse neural actor repository for high- fidelity human-centric rendering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). ...

  6. [7]

    In: Leonardis, A., Ricci, E., Roth, S., Rus- sakovsky, O., Sattler, T., Varol, G

    Choi, Y., Kwak, S., Lee, K., Choi, H., Shin, J.: Improving diffusion models for authentic virtual try-on in the wild. In: Leonardis, A., Ricci, E., Roth, S., Rus- sakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 206–235. Springer Nature Switzerland, Cham (2025) 5

  7. [8]

    In: Forty-first International Conference on Machine Learning (2024) 3

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high- resolution image synthesis. In: Forty-first International Conference on Machine Learning (2024) 3

  8. [9]

    In: European Conference on Computer Vision

    Fu, J., Li, S., Jiang, Y., Lin, K.Y., Qian, C., Loy, C.C., Wu, W., Liu, Z.: Stylegan- human: A data-centric odyssey of human generation. In: European Conference on Computer Vision. pp. 1–19. Springer (2022) 4

  9. [10]

    Communications of the ACM63(11), 139–144 (2020) 4

    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM63(11), 139–144 (2020) 4

  10. [11]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretrain- ing for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022) 5

  11. [12]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Hu, L.: Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8153–8163 (2024) 2, 5

  12. [13]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 12

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: VBench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 12

  13. [14]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 3, 5, 16

  14. [15]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with con- ditional adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1125–1134 (2017) 4

  15. [16]

    co / jasperai / Flux

    jasperai: Flux.1-dev-controlnet-surface-normals.https : / / huggingface . co / jasperai / Flux . 1 - dev - Controlnet - Surface - Normals(2025), hugging Face Model, accessed: 2025-01-21 9

  16. [17]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Ju, X., Zeng, A., Zhao, C., Wang, J., Zhang, L., Xu, Q.: Humansd: A native skeleton-guided diffusion model for human image generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15988–15998 (2023) 5 18 Z. Sun et al

  17. [18]

    Cotracker3: Simpler and better point tracking by pseudo-labelling real videos

    Karaev, N., Makarov, I., Wang, J., Neverova, N., Vedaldi, A., Rupprecht, C.: Co- tracker3: Simpler and better point tracking by pseudo-labelling real videos. In: Proc. arXiv:2410.11831 (2024) 12

  18. [19]

    Progressive Growing of GANs for Improved Quality, Stability, and Variation

    Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for im- proved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017) 4

  19. [20]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019) 4

  20. [21]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8110–8119 (2020) 4

  21. [22]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollar, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4015–4026 (October 2023) 16

  22. [23]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 5

  23. [24]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow match- ing for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025) 3, 5, 9

  24. [25]

    arXiv preprint arXiv:2505.01838 (2025) 3, 10

    Li, C., Liao, H., Zhi, Y., Yang, X., Sun, Z., Chang, J., Cui, S., Han, X.: Mvhu- mannet++: A large-scale dataset of multi-view daily dressing human captures with richer annotations for 3d human digitization. arXiv preprint arXiv:2505.01838 (2025) 3, 10

  25. [26]

    ACM Transactions on Graphics (TOG)44(6), 1–21 (2025) 16

    Lin, X., Yu, F., Hu, J., You, Z., Shi, W., Ren, J.S., Gu, J., Dong, C.: Harnessing diffusion-yielded score priors for image restoration. ACM Transactions on Graphics (TOG)44(6), 1–21 (2025) 16

  26. [27]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 4

  27. [28]

    Liu, L., Ma, T., Li, B., Chen, Z., Liu, J., Li, G., Zhou, S., He, Q., Wu, X.: Phan- tom:Subject-consistentvideogenerationviacross-modalalignment.arXivpreprint arXiv:2502.11079 (2025) 2, 5

  28. [29]

    In: Proceedings of the 32nd ACM International Conference on Multimedia

    Liu, S., Zhao, Z., Zhi, Y., Zhao, Y., Huang, B., Wang, S., Wang, R., Xuan, M., Li, Z., Gao, S.: Heromaker: Human-centric video editing with motion priors. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 3761– 3770 (2024) 5

  29. [30]

    arXiv preprint arXiv:2310.08579 (2023) 5

    Liu, X., Ren, J., Siarohin, A., Skorokhodov, I., Li, Y., Lin, D., Liu, X., Liu, Z., Tulyakov, S.: Hyperhuman: Hyper-realistic human generation with latent struc- tural diffusion. arXiv preprint arXiv:2310.08579 (2023) 5

  30. [31]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Lu, Y., Zhang, M., Ma, A.J., Xie, X., Lai, J.: Coarse-to-fine latent diffusion for pose-guided person image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6420–6429 (2024) 5

  31. [32]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Men, Y., Yao, Y., Cui, M., Bo, L.: Mimo: Controllable character video synthesis with spatial decomposed modeling. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 21181–21191 (2025) 5

  32. [33]

    In: CVPR Workshops (June

    Morelli, D., Fincato, M., Cornia, M., Landi, F., Cesari, F., Cucchiara, R.: Dress code: High-resolution multi-category virtual try-on. In: CVPR Workshops (June

  33. [34]

    5 ReImagine: Image-First Human Video Generation 19

  34. [35]

    arXiv preprint arXiv:2501.05369 (2025) 5

    Ning,S.,Qin,Y.,Han,X.:1-2-1:Renaissanceofsingle-networkparadigmforvirtual try-on. arXiv preprint arXiv:2501.05369 (2025) 5

  35. [36]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 4

  36. [37]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 4

  37. [38]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684– 10695 (June 2022) 4, 5

  38. [39]

    Lipson, Z

    Sarkar, K., Liu, L., Golyanik, V., Theobalt, C.: Humangan: A generative model of human images. In: 2021 International Conference on 3D Vision (3DV). pp. 258–267 (2021).https://doi.org/10.1109/3DV53792.2021.000364

  39. [40]

    Human4DiT: Free-view human video generation with 4D diffusion transformer.arXiv preprint arXiv:2405.17405, 2024a

    Shao, R., Pang, Y., Zheng, Z., Sun, J., Liu, Y.: Human4dit: 360-degree human video generation with 4d diffusion transformer. arXiv preprint arXiv:2405.17405 (2024) 2, 5, 10

  40. [41]

    arXiv preprint (2024) 5

    Shen, F., Jiang, X., He, X., Ye, H., Wang, C., Du, X., Li, Z., Tang, J.: Imagdressing- v1: Customizable virtual dressing. arXiv preprint (2024) 5

  41. [42]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020) 4

  42. [43]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) 5

  43. [44]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Tu, S., Xing, Z., Han, X., Cheng, Z.Q., Dai, Q., Luo, C., Wu, Z.: Stableanimator: High-quality identity-preserving human image animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 21096–21106 (2025) 5

  44. [45]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 1, 3, 5, 9, 12

  45. [46]

    Science China Information Sciences (2025) 1, 5

    Wang, X., Zhang, S., Gao, C., Wang, J., Zhou, X., Zhang, Y., Yan, L., Sang, N.: Unianimate: Taming unified video diffusion models for consistent human image animation. Science China Information Sciences (2025) 1, 5

  46. [47]

    arXiv preprint arXiv:2409.19911 (2024) 5

    Wang, X., Zhang, S., Qiu, H., Chu, R., Li, Z., Zhang, Y., Gao, C., Wang, Y., Shen, C., Sang, N.: Replace anyone in videos. arXiv preprint arXiv:2409.19911 (2024) 5

  47. [48]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) 3, 5, 10

  48. [49]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Xiong, Z., Li, C., Liu, K., Liao, H., Hu, J., Zhu, J., Ning, S., Qiu, L., Wang, C., Wang, S., Cui, S., Han, X.: Mvhumannet: A large-scale dataset of multi-view daily dressing human captures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19801–19811 (June 2024) 10, 15, 16

  49. [50]

    arXiv preprint arXiv:2403.01779 (2024) 5

    Xu, Y., Gu, T., Chen, W., Chen, C.: Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on. arXiv preprint arXiv:2403.01779 (2024) 5

  50. [51]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xu, Z., Zhang, J., Liew, J.H., Yan, H., Liu, J.W., Zhang, C., Feng, J., Shou, M.Z.: Magicanimate: Temporally consistent human image animation using diffu- sion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1481–1490 (2024) 2, 5 20 Z. Sun et al

  51. [52]

    Advances in Neural Information Processing Systems 37, 51039–51062 (2024) 5

    Yang, Q., Guan, J., Wang, K., Yu, L., Chu, W., Zhou, H., Feng, Z., Feng, H., Ding, E., Wang, J., et al.: Showmaker: Creating high-fidelity 2d human video via fine- grained diffusion modeling. Advances in Neural Information Processing Systems 37, 51039–51062 (2024) 5

  52. [53]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Yang, Z., Zeng, A., Yuan, C., Li, Y.: Effective whole-body pose estimation with two-stages distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4210–4220 (2023) 11

  53. [54]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024) 5

  54. [55]

    In: European Conference on Computer Vision

    Zhai, Y., Lin, K., Li, L., Lin, C.C., Wang, J., Yang, Z., Doermann, D., Yuan, J., Liu, Z., Wang, L.: Idol: Unified dual-modal latent diffusion for human-centric joint video-depth generation. In: European Conference on Computer Vision. pp. 134–152. Springer (2024) 5

  55. [56]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023) 5

  56. [57]

    Mim- icmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024

    Zhang, Y., Gu, J., Wang, L.W., Wang, H., Cheng, J., Zhu, Y., Zou, F.: Mimic- motion: High-quality human motion video generation with confidence-aware pose guidance. arXiv preprint arXiv:2406.19680 (2024) 5

  57. [58]

    In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

    Zhi, Y., Li, C., Liao, H., Yang, X., Sun, Z., Chang, J., Cun, X., Feng, W., Han, X.: Mv-performer: Taming video diffusion model for faithful and synchronized multi- view performer synthesis. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–14 (2025) 2, 5

  58. [59]

    In: European Conference on Com- puter Vision

    Zhong, X., Huang, X., Yang, X., Lin, G., Wu, Q.: Deco: Decoupled human-centered diffusion video editing with motion consistency. In: European Conference on Com- puter Vision. pp. 352–370. Springer (2024) 5

  59. [60]

    In: ECCV

    Zhu, S., Chen, J.L., Dai, Z., Dong, Z., Xu, Y., Cao, X., Yao, Y., Zhu, H., Zhu, S.: Champ: Controllable and consistent human image animation with 3d parametric guidance. In: ECCV. Springer (2024) 2, 5 ReImagine: Image-First Human Video Generation 21 Supplementary Material This document provides additional discussion, experimental details, ablation studies...