arxiv: 2604.19720 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Zhengwentai Sun , Keru Zheng , Chenghong Li , Hongjie Liao , Xihe Yang , Heyuan Li , Yihao Zhi , Shuliang Ning

show 2 more authors

Shuguang Cui Xiaoguang Han

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords human video generationcontrollable synthesisimage-first approachSMPL-X guidancevideo diffusion refinementpose and viewpoint controltemporal consistency

0 comments

The pith

High-quality controllable human videos are generated by first creating appearance via image models then applying motion and temporal refinement without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that human video generation struggles when appearance, motion, and viewpoint are modeled jointly under scarce multi-view data. Instead, it decouples the problem by first training a high-quality image generator on human appearance, then guiding that output with SMPL-X pose and viewpoint controls, and finally applying a training-free refinement step drawn from a pretrained video diffusion model to enforce temporal consistency. This image-first pipeline aims to deliver videos that maintain visual fidelity while allowing diverse pose and camera control. The authors support the approach with a new canonical human dataset and an auxiliary compositional image model.

Core claim

By treating high-quality human appearance as a prior learned through image generation and then layering SMPL-X-based motion guidance plus training-free temporal refinement from a video diffusion model, the method produces temporally consistent, high-quality videos under varied poses and viewpoints without requiring joint end-to-end training on video data.

What carries the argument

A pretrained image backbone that supplies appearance priors, combined with SMPL-X parametric body guidance for pose and viewpoint control, followed by a training-free temporal refinement stage that uses a separate pretrained video diffusion model.

If this is right

Videos can be produced with independent control over identity appearance, body pose sequence, and camera trajectory.
No video-specific fine-tuning is needed once the image backbone and refinement model are pretrained.
New human identities can be introduced by swapping the image-generation stage while keeping the same motion and refinement pipeline.
The released canonical dataset enables direct comparison of appearance priors across methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same image-first separation could extend to other domains such as animal or object video generation where multi-view video data is scarce.
If the refinement stage can be made conditional on additional signals, finer control over lighting or clothing dynamics might become possible without retraining.
Compositional image models released with the paper could allow mixing body parts or outfits at the appearance stage before motion is applied.

Load-bearing premise

High-quality appearance learned only from still images can transfer directly to video synthesis when guided by SMPL-X and refined with an off-the-shelf video model, without any joint training on video data.

What would settle it

Generate videos of the same person in extreme novel viewpoints or rapid pose transitions; if visible artifacts, identity drift, or temporal flickering appear at rates higher than competing joint-training methods, the image-first prior fails to carry over effectively.

Figures

Figures reproduced from arXiv: 2604.19720 by Chenghong Li, Heyuan Li, Hongjie Liao, Keru Zheng, Shuguang Cui, Shuliang Ning, Xiaoguang Han, Xihe Yang, Yihao Zhi, Zhengwentai Sun.

**Figure 1.** Figure 1: Our method enables controllable human synthesis at multiple levels. (a) Our pipeline generates temporally coherent videos with explicit control over body pose and camera viewpoint. (b) Our image model generalizes to in-the-wild references, producing diverse poses and viewpoints with consistent appearance. (c) As an additional contribution, our end-to-end model supports compositional synthesis with disenta… view at source ↗

**Figure 2.** Figure 2: Overview of our image-first training and inference paradigm. (a) During training, a powerful pretrained image backbone is fine-tuned via lightweight LoRA adaptation using an imperfect multi-view dataset with structured pose and viewpoint supervision. (b) At inference, the fine-tuned model generalizes to high-quality canonical inputs and enables pose- and viewpoint-controllable human synthesis. This high… view at source ↗

**Figure 3.** Figure 3: Overview of the proposed pose- and view-guided generation module. SMPLX–based pose and canonical front/back appearance cues are unified in a token sequence and processed by a DiT backbone with condition-aware RoPE. The final image is obtained by decoding the generated latent using a VAE decoder (not shown for clarity). 3 Method 3.1 Overview We propose ReImagine, an image-first framework for human video sy… view at source ↗

**Figure 4.** Figure 4: Training-free temporal consistency via low-noise re-denoising and spatiotemporal spectral regularization. Boxes highlight regions with improved temporal consistency. with the same transformation applied to kj . Attention is computed as \mathrm {Attn}(\mathbf {q}_i, \mathbf {k}_j) = \frac { \mathrm {RoPE}(\mathbf {q}_i; \mathbf {p}_i)^\top \mathrm {RoPE}(\mathbf {k}_j; \mathbf {p}_j) }{ \sqrt {d} }. (7) In… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison for image-to-video human synthesis on the MVHumanNet++ dataset [25]. We compare our method with Wan-Fun [1], Wan-Animate (Wan-Ani) [3], Qwen [47], and Human4DiT [39]. The ground truth (GT) is shown in the first column. GT Ours Wan-Fun Wan-Ani Qwen Human4DiT GT Ours Wan-Fun Wan-Ani Qwen Human4DiT [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison on the DNA-Rendering dataset [6]. Our method is evaluated in a zero-shot setting without training on this dataset, demonstrating strong generalization under more challenging viewpoints training-free manner, and no parameters are optimized. The number of diffusion inference steps is set to 20 for both the image synthesis module and the trainingfree temporal consistency module. During… view at source ↗

**Figure 7.** Figure 7: Temporal consistency ablation via tracking visualization. In contrast, our method achieves substantially better temporal consistency, obtaining an FVD of 0.275 on MVHumanNet compared to 0.403 for WanAnimate. Although Qwen-Image-Edit attains a relatively high SSIM (0.831), this metric is misleading for video generation: it produces high-quality individual frames but lacks temporal coherence, resulting in … view at source ↗

**Figure 8.** Figure 8: Ablation on missing back-view appearance input. The two leftmost columns show the input appearance images, where the back-view input is replaced by a blank image in row (a) [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison between our image-first method and a video-first baseline (Uni-Animate DiT). The leftmost column shows the canonical front reference input (back reference omitted for space). Each row corresponds to the same subject under different poses. Red dashed boxes highlight results generated by our image-first pipeline, while the remaining images are produced by the video-first baseline. Ou… view at source ↗

**Figure 10.** Figure 10: Data construction pipeline for building canonical and disentangled human assets from MVHumanNet [48] Input\pose [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: More results produced by our end-to-end method. stage only needs to enforce motion coherence across independently generated frames, rather than learning full video synthesis from limited video data. Discussion. These results support our motivation to formulate controllable human video generation as a motion-guided image synthesis problem. By relying on strong image generation priors and lightweight tempo… view at source ↗

**Figure 12.** Figure 12: Qualitative results on in-the-wild appearance inputs. Given a single reference image and SMPL-X pose sequences, our model generates consistent human images under diverse poses and viewpoints while preserving identity and clothing details [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative results under large viewpoint changes. Starting from a canonical appearance input, the model synthesizes consistent human images as the camera viewpoint rotates around the subject, demonstrating strong control over viewpoint and pose. We present additional qualitative results to further demonstrate the generalization ability and controllability of our framework. As shown in [PITH_FULL_IMAGE… view at source ↗

**Figure 14.** Figure 14: Examples of additional applications enabled by our canonical asset-based training. The model can synthesize humans under diverse poses, combine identity and clothing assets for creative generation, and control pose under arbitrary viewpoints. model can robustly handle in-the-wild appearance inputs and synthesize consistent human images under diverse poses and viewpoints. Despite the variability in real-w… view at source ↗

read the original abstract

Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReImagine splits human video generation into image-first appearance then SMPL-X motion plus training-free video refinement, but the no-training consistency step is the part that needs real proof.

read the letter

The main thing here is the image-first pipeline: learn high-quality human looks from a pretrained image model, drive them with SMPL-X for pose and viewpoint control, then apply a pretrained video diffusion model to smooth temporal issues without any joint training. This decoupling is the clearest difference from the joint approaches mentioned in the abstract, and they back it with released code, a canonical human dataset, and an auxiliary compositional image model.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce ReImagine, an image-first pipeline for pose- and viewpoint-controllable high-quality human video generation. High-quality appearance is learned via a pretrained image backbone and used as a prior; motion and viewpoint are controlled via SMPL-X guidance; temporal consistency is achieved through a training-free refinement stage that employs a pretrained video diffusion model. The authors also release a canonical human dataset and an auxiliary model for compositional human image synthesis, with code and data made publicly available.

Significance. If the central claims hold, the work would be significant for its modular decoupling of appearance modeling (via image priors) from motion and temporal consistency, potentially reducing reliance on scarce multi-view video data. The public release of the dataset, auxiliary model, and code is a clear strength that supports reproducibility and downstream research in controllable human video synthesis.

major comments (2)

[§3] §3 (Method), temporal refinement stage: the claim that a training-free video diffusion model can reliably resolve frame-to-frame appearance/lighting mismatches induced by SMPL-X pose and viewpoint changes (without joint training or fine-tuning) is load-bearing for the temporal-consistency guarantee, yet the manuscript provides no ablations or failure-case analysis on distribution shift between the image-backbone outputs and the diffusion model's training distribution.
[§4] §4 (Experiments): quantitative support for the claim of 'high-quality, temporally consistent videos under diverse poses and viewpoints' is not fully detailed in the provided description; without reported metrics (e.g., FVD, temporal consistency scores) and direct comparisons against jointly-trained baselines on the same SMPL-X-driven test cases, the superiority of the image-first + training-free approach cannot be assessed.

minor comments (2)

The abstract and method overview would benefit from a concise diagram or pseudocode summarizing the three-stage pipeline (image backbone → SMPL-X guidance → training-free refinement) to clarify data flow and conditioning.
Notation for SMPL-X parameters (pose, shape, viewpoint) should be explicitly defined in the first use within the method section to avoid ambiguity for readers unfamiliar with the exact parameterization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work's significance, the recognition of our public releases, and the constructive feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Method), temporal refinement stage: the claim that a training-free video diffusion model can reliably resolve frame-to-frame appearance/lighting mismatches induced by SMPL-X pose and viewpoint changes (without joint training or fine-tuning) is load-bearing for the temporal-consistency guarantee, yet the manuscript provides no ablations or failure-case analysis on distribution shift between the image-backbone outputs and the diffusion model's training distribution.

Authors: We agree that additional ablations and failure-case analysis for the training-free temporal refinement stage would better substantiate the claims. In the revised manuscript, we will add experiments ablating the refinement stage (with/without it) across diverse SMPL-X pose and viewpoint changes, including quantitative measures of mismatch resolution and qualitative discussion of remaining failure cases due to distribution shifts between the image backbone outputs and the video diffusion model's training data. revision: yes
Referee: [§4] §4 (Experiments): quantitative support for the claim of 'high-quality, temporally consistent videos under diverse poses and viewpoints' is not fully detailed in the provided description; without reported metrics (e.g., FVD, temporal consistency scores) and direct comparisons against jointly-trained baselines on the same SMPL-X-driven test cases, the superiority of the image-first + training-free approach cannot be assessed.

Authors: We acknowledge that expanded quantitative evaluation would allow better assessment of the approach. The revised experiments section will report standard metrics such as Fréchet Video Distance (FVD) and temporal consistency scores. We will also add direct comparisons to relevant jointly-trained baselines on the same SMPL-X-driven test cases, using our publicly released canonical human dataset and code to support reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline assembles independent pretrained components

full rationale

The paper presents a pipeline that decouples appearance modeling (via a pretrained image backbone) from motion (via SMPL-X guidance) and applies a separate training-free refinement stage using an off-the-shelf pretrained video diffusion model. No equations, fitted parameters, or self-citations are shown that would reduce any claimed prediction or result to the inputs by construction. The central claims rest on the empirical combination of existing models rather than on any self-definitional loop, uniqueness theorem imported from the authors' prior work, or renaming of known results. This is the standard case of an engineering synthesis paper whose derivation chain remains externally grounded.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The central claim rests on assumptions about the compatibility and effectiveness of pretrained diffusion models and the SMPL-X body model for decoupling appearance and motion; no new entities are invented.

axioms (3)

domain assumption Pretrained image diffusion models provide high-quality human appearance priors suitable for video synthesis.
Core to the image-first decoupling strategy described in the abstract.
domain assumption SMPL-X-based guidance is sufficient to control diverse poses and viewpoints in the generated video.
Invoked for the controllable pipeline stage.
domain assumption A training-free refinement stage using a pretrained video diffusion model can enforce temporal consistency.
Key premise for the final stage without additional training.

pith-pipeline@v0.9.0 · 5478 in / 1538 out tokens · 53010 ms · 2026-05-10T02:22:15.304373+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 24 canonical work pages · 12 internal anchors

[1]

Apache-2.0 license

Alibaba-PAI: Wan2.1-fun-v1.1-14b-control.https://huggingface.co/alibaba- pai/Wan2.1-Fun-V1.1-14B-Control(2025), hugging Face Model, accessed: 2025- 01-21. Apache-2.0 license. 2, 10

2025
[2]

HuMo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025

Chen, L., Ma, T., Liu, J., Li, B., Chen, Z., Liu, L., He, X., Li, G., He, Q., Wu, Z.: Humo:Human-centricvideogenerationviacollaborativemulti-modalconditioning. arXiv preprint arXiv:2509.08519 (2025) 2, 5

work page arXiv 2025
[4]

CoRR , volume =

Cheng, G., Gao, X., Hu, L., Hu, S., Huang, M., Ji, C., Li, J., Meng, D., Qi, J., Qiao, P., et al.: Wan-animate: Unified character animation and replacement with holistic replication. arXiv preprint arXiv:2509.14055 (2025) 2, 10

work page arXiv 2025
[5]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., Shan, Y.: Yolo-world: Real-time open-vocabulary object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16901–16911 (2024) 16

2024
[6]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Cheng, W., Chen, R., Fan, S., Yin, W., Chen, K., Cai, Z., Wang, J., Gao, Y., Yu, Z., Lin, Z., Ren, D., Yang, L., Liu, Z., Loy, C.C., Qian, C., Wu, W., Lin, D., Dai, B., Lin, K.Y.: Dna-rendering: A diverse neural actor repository for high- fidelity human-centric rendering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). ...

2023
[7]

In: Leonardis, A., Ricci, E., Roth, S., Rus- sakovsky, O., Sattler, T., Varol, G

Choi, Y., Kwak, S., Lee, K., Choi, H., Shin, J.: Improving diffusion models for authentic virtual try-on in the wild. In: Leonardis, A., Ricci, E., Roth, S., Rus- sakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 206–235. Springer Nature Switzerland, Cham (2025) 5

2024
[8]

In: Forty-first International Conference on Machine Learning (2024) 3

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high- resolution image synthesis. In: Forty-first International Conference on Machine Learning (2024) 3

2024
[9]

In: European Conference on Computer Vision

Fu, J., Li, S., Jiang, Y., Lin, K.Y., Qian, C., Loy, C.C., Wu, W., Liu, Z.: Stylegan- human: A data-centric odyssey of human generation. In: European Conference on Computer Vision. pp. 1–19. Springer (2022) 4

2022
[10]

Communications of the ACM63(11), 139–144 (2020) 4

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM63(11), 139–144 (2020) 4

2020
[11]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretrain- ing for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022) 5

work page internal anchor Pith review arXiv 2022
[12]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hu, L.: Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8153–8163 (2024) 2, 5

2024
[13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 12

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: VBench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 12

2024
[14]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 3, 5, 16

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with con- ditional adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1125–1134 (2017) 4

2017
[16]

co / jasperai / Flux

jasperai: Flux.1-dev-controlnet-surface-normals.https : / / huggingface . co / jasperai / Flux . 1 - dev - Controlnet - Surface - Normals(2025), hugging Face Model, accessed: 2025-01-21 9

2025
[17]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Ju, X., Zeng, A., Zhao, C., Wang, J., Zhang, L., Xu, Q.: Humansd: A native skeleton-guided diffusion model for human image generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15988–15998 (2023) 5 18 Z. Sun et al

2023
[18]

Cotracker3: Simpler and better point tracking by pseudo-labelling real videos

Karaev, N., Makarov, I., Wang, J., Neverova, N., Vedaldi, A., Rupprecht, C.: Co- tracker3: Simpler and better point tracking by pseudo-labelling real videos. In: Proc. arXiv:2410.11831 (2024) 12

work page arXiv 2024
[19]

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for im- proved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017) 4

work page internal anchor Pith review arXiv 2017
[20]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019) 4

2019
[21]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8110–8119 (2020) 4

2020
[22]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollar, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4015–4026 (October 2023) 16

2023
[23]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow match- ing for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025) 3, 5, 9

work page internal anchor Pith review arXiv 2025
[25]

arXiv preprint arXiv:2505.01838 (2025) 3, 10

Li, C., Liao, H., Zhi, Y., Yang, X., Sun, Z., Chang, J., Cui, S., Han, X.: Mvhu- mannet++: A large-scale dataset of multi-view daily dressing human captures with richer annotations for 3d human digitization. arXiv preprint arXiv:2505.01838 (2025) 3, 10

work page arXiv 2025
[26]

ACM Transactions on Graphics (TOG)44(6), 1–21 (2025) 16

Lin, X., Yu, F., Hu, J., You, Z., Shi, W., Ren, J.S., Gu, J., Dong, C.: Harnessing diffusion-yielded score priors for image restoration. ACM Transactions on Graphics (TOG)44(6), 1–21 (2025) 16

2025
[27]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Liu, L., Ma, T., Li, B., Chen, Z., Liu, J., Li, G., Zhou, S., He, Q., Wu, X.: Phan- tom:Subject-consistentvideogenerationviacross-modalalignment.arXivpreprint arXiv:2502.11079 (2025) 2, 5

work page arXiv 2025
[29]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Liu, S., Zhao, Z., Zhi, Y., Zhao, Y., Huang, B., Wang, S., Wang, R., Xuan, M., Li, Z., Gao, S.: Heromaker: Human-centric video editing with motion priors. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 3761– 3770 (2024) 5

2024
[30]

arXiv preprint arXiv:2310.08579 (2023) 5

Liu, X., Ren, J., Siarohin, A., Skorokhodov, I., Li, Y., Lin, D., Liu, X., Liu, Z., Tulyakov, S.: Hyperhuman: Hyper-realistic human generation with latent struc- tural diffusion. arXiv preprint arXiv:2310.08579 (2023) 5

work page arXiv 2023
[31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lu, Y., Zhang, M., Ma, A.J., Xie, X., Lai, J.: Coarse-to-fine latent diffusion for pose-guided person image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6420–6429 (2024) 5

2024
[32]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Men, Y., Yao, Y., Cui, M., Bo, L.: Mimo: Controllable character video synthesis with spatial decomposed modeling. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 21181–21191 (2025) 5

2025
[33]

In: CVPR Workshops (June

Morelli, D., Fincato, M., Cornia, M., Landi, F., Cesari, F., Cucchiara, R.: Dress code: High-resolution multi-category virtual try-on. In: CVPR Workshops (June
[34]

5 ReImagine: Image-First Human Video Generation 19
[35]

arXiv preprint arXiv:2501.05369 (2025) 5

Ning,S.,Qin,Y.,Han,X.:1-2-1:Renaissanceofsingle-networkparadigmforvirtual try-on. arXiv preprint arXiv:2501.05369 (2025) 5

work page arXiv 2025
[36]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 4

2023
[37]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 4

work page internal anchor Pith review arXiv 2023
[38]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684– 10695 (June 2022) 4, 5

2022
[39]

Lipson, Z

Sarkar, K., Liu, L., Golyanik, V., Theobalt, C.: Humangan: A generative model of human images. In: 2021 International Conference on 3D Vision (3DV). pp. 258–267 (2021).https://doi.org/10.1109/3DV53792.2021.000364

work page doi:10.1109/3dv53792.2021.000364 2021
[40]

Human4DiT: Free-view human video generation with 4D diffusion transformer.arXiv preprint arXiv:2405.17405, 2024a

Shao, R., Pang, Y., Zheng, Z., Sun, J., Liu, Y.: Human4dit: 360-degree human video generation with 4d diffusion transformer. arXiv preprint arXiv:2405.17405 (2024) 2, 5, 10

work page arXiv 2024
[41]

arXiv preprint (2024) 5

Shen, F., Jiang, X., He, X., Ye, H., Wang, C., Du, X., Li, Z., Tang, J.: Imagdressing- v1: Customizable virtual dressing. arXiv preprint (2024) 5

2024
[42]

Denoising Diffusion Implicit Models

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020) 4

work page internal anchor Pith review Pith/arXiv arXiv 2010
[43]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Tu, S., Xing, Z., Han, X., Cheng, Z.Q., Dai, Q., Luo, C., Wu, Z.: Stableanimator: High-quality identity-preserving human image animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 21096–21106 (2025) 5

2025
[45]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 1, 3, 5, 9, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Science China Information Sciences (2025) 1, 5

Wang, X., Zhang, S., Gao, C., Wang, J., Zhou, X., Zhang, Y., Yan, L., Sang, N.: Unianimate: Taming unified video diffusion models for consistent human image animation. Science China Information Sciences (2025) 1, 5

2025
[47]

arXiv preprint arXiv:2409.19911 (2024) 5

Wang, X., Zhang, S., Qiu, H., Chu, R., Li, Z., Zhang, Y., Gao, C., Wang, Y., Shen, C., Sang, N.: Replace anyone in videos. arXiv preprint arXiv:2409.19911 (2024) 5

work page arXiv 2024
[48]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) 3, 5, 10

work page internal anchor Pith review arXiv 2025
[49]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xiong, Z., Li, C., Liu, K., Liao, H., Hu, J., Zhu, J., Ning, S., Qiu, L., Wang, C., Wang, S., Cui, S., Han, X.: Mvhumannet: A large-scale dataset of multi-view daily dressing human captures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19801–19811 (June 2024) 10, 15, 16

2024
[50]

arXiv preprint arXiv:2403.01779 (2024) 5

Xu, Y., Gu, T., Chen, W., Chen, C.: Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on. arXiv preprint arXiv:2403.01779 (2024) 5

work page arXiv 2024
[51]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xu, Z., Zhang, J., Liew, J.H., Yan, H., Liu, J.W., Zhang, C., Feng, J., Shou, M.Z.: Magicanimate: Temporally consistent human image animation using diffu- sion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1481–1490 (2024) 2, 5 20 Z. Sun et al

2024
[52]

Advances in Neural Information Processing Systems 37, 51039–51062 (2024) 5

Yang, Q., Guan, J., Wang, K., Yu, L., Chu, W., Zhou, H., Feng, Z., Feng, H., Ding, E., Wang, J., et al.: Showmaker: Creating high-fidelity 2d human video via fine- grained diffusion modeling. Advances in Neural Information Processing Systems 37, 51039–51062 (2024) 5

2024
[53]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Yang, Z., Zeng, A., Yuan, C., Li, Y.: Effective whole-body pose estimation with two-stages distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4210–4220 (2023) 11

2023
[54]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024) 5

work page internal anchor Pith review arXiv 2024
[55]

In: European Conference on Computer Vision

Zhai, Y., Lin, K., Li, L., Lin, C.C., Wang, J., Yang, Z., Doermann, D., Yuan, J., Liu, Z., Wang, L.: Idol: Unified dual-modal latent diffusion for human-centric joint video-depth generation. In: European Conference on Computer Vision. pp. 134–152. Springer (2024) 5

2024
[56]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023) 5

2023
[57]

Mim- icmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024

Zhang, Y., Gu, J., Wang, L.W., Wang, H., Cheng, J., Zhu, Y., Zou, F.: Mimic- motion: High-quality human motion video generation with confidence-aware pose guidance. arXiv preprint arXiv:2406.19680 (2024) 5

work page arXiv 2024
[58]

In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

Zhi, Y., Li, C., Liao, H., Yang, X., Sun, Z., Chang, J., Cun, X., Feng, W., Han, X.: Mv-performer: Taming video diffusion model for faithful and synchronized multi- view performer synthesis. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–14 (2025) 2, 5

2025
[59]

In: European Conference on Com- puter Vision

Zhong, X., Huang, X., Yang, X., Lin, G., Wu, Q.: Deco: Decoupled human-centered diffusion video editing with motion consistency. In: European Conference on Com- puter Vision. pp. 352–370. Springer (2024) 5

2024
[60]

In: ECCV

Zhu, S., Chen, J.L., Dai, Z., Dong, Z., Xu, Y., Cao, X., Yao, Y., Zhu, H., Zhu, S.: Champ: Controllable and consistent human image animation with 3d parametric guidance. In: ECCV. Springer (2024) 2, 5 ReImagine: Image-First Human Video Generation 21 Supplementary Material This document provides additional discussion, experimental details, ablation studies...

2024