SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation

Gangshan Wu; Haolan Chen; Jiaming Zhang; Kai Ma; Limin Wang; Rui Li; Shengming Cao; Xiaotong Zhao; Xinglin Hou; Yutao Cui

arxiv: 2511.19320 · v2 · pith:BGDZMYBEnew · submitted 2025-11-24 · 💻 cs.CV

SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation

Jiaming Zhang , Shengming Cao , Rui Li , Xiaotong Zhao , Yutao Cui , Xinglin Hou , Gangshan Wu , Haolan Chen

show 3 more authors

Yu Xu Limin Wang Kai Ma

This is my paper

Pith reviewed 2026-05-21 18:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords human image animationfirst-frame preservationimage-to-videocondition reconciliationpose modulationdecoupled trainingmotion control

0 comments

The pith

SteadyDancer is the first framework to robustly preserve first-frame identity in human image animation through a reconciled Image-to-Video approach.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that shifting to an Image-to-Video paradigm and adding reconciliation for conditions allows robust preservation of the first frame's identity during animation. It introduces three main components: a Condition-Reconciliation Mechanism, Synergistic Pose Modulation Modules, and a Staged Decoupled-Objective Training Pipeline. A reader would care because this addresses common failures like identity drift in generating videos from one image and motion data. If it works, it enables more reliable and efficient creation of coherent human animations for various applications.

Core claim

SteadyDancer is an Image-to-Video (I2V) paradigm-based framework that achieves harmonized and coherent animation and is the first to ensure first-frame preservation robustly. Firstly, it proposes a Condition-Reconciliation Mechanism to harmonize the two conflicting conditions, enabling precise control without sacrificing fidelity. Secondly, it designs Synergistic Pose Modulation Modules to generate an adaptive and coherent pose representation that is highly compatible with the reference image. Finally, it employs a Staged Decoupled-Objective Training Pipeline that hierarchically optimizes the model for motion fidelity, visual quality, and temporal coherence.

What carries the argument

Condition-Reconciliation Mechanism that harmonizes conflicting reference and motion conditions in the I2V framework

If this is right

Achieves state-of-the-art performance in both appearance fidelity and motion control
Requires significantly fewer training resources than comparable methods
Produces harmonized and coherent results without identity drift or visual artifacts
Enables robust first-frame preservation for the first time in this setting

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar reconciliation techniques could be adapted for other conditional generation tasks like text-guided video editing.
The reduced training resources might lower the barrier for creating custom animation models in smaller research groups.
Extending the staged training to include more objectives could further improve long-term temporal coherence in generated videos.

Load-bearing premise

The Condition-Reconciliation Mechanism can harmonize the conflicting reference and motion conditions without introducing new visual artifacts or losing motion precision.

What would settle it

Videos generated by the model on challenging inputs with large spatio-temporal misalignments between the reference image and motion sequence still exhibiting identity drift or artifacts would falsify the claim.

Figures

Figures reproduced from arXiv: 2511.19320 by Gangshan Wu, Haolan Chen, Jiaming Zhang, Kai Ma, Limin Wang, Rui Li, Shengming Cao, Xiaotong Zhao, Xinglin Hou, Yutao Cui, Yu Xu.

**Figure 1.** Figure 1: We introduce SteadyDancer, an Image-to-Video (I2V) paradigm-based human animation framework, with Motion-to-Image [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: (a) The spatio-temporal misalignment in practical sce [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: An overview of SteadyDancer, a Human Image Animation framework based on the Image-to-Video (I2V) paradigm. First, it em [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Pose Simulation in Motion Discontinuity Mitigation of [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Model performance across various training steps. The [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation of Condition-Reconciliation Mechanism. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparisons between SteadyDancer and other methods on the X-Dance benchmark. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparisons between SteadyDancer and other methods on the X-Dance benchmark. Each example displays the [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization on RealisDance-Val. Even when driven solely by human pose signals, our model successfully synthesizes the [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison on RealisDance-Val. Compared to other models, our model not only achieves precise control but also [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 12.** Figure 12: Ablation study of Condition-Decoupled Distillation. [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 13.** Figure 13: Ablation study of Motion Discontinuity Mitigation. [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗

**Figure 14.** Figure 14: Examples from the X-Dance benchmark. The second and third rows display driving video sequences, comprising both intricate, [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: Performance comparison of four Motion Discontinuity Mitigation methods, showing that the Pose Simulation approach gener [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: Model performance using Decoupled-Condition Classifier-Free Guidance (DC-CFG). From top to bottom, the rows display the [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

read the original abstract

Preserving first-frame identity while ensuring precise motion control is a fundamental challenge in human image animation. The Image-to-Motion Binding process of the dominant Reference-to-Video (R2V) paradigm overlooks critical spatio-temporal misalignments common in real-world applications, leading to failures such as identity drift and visual artifacts. We introduce SteadyDancer, an Image-to-Video (I2V) paradigm-based framework that achieves harmonized and coherent animation and is the first to ensure first-frame preservation robustly. Firstly, we propose a Condition-Reconciliation Mechanism to harmonize the two conflicting conditions, enabling precise control without sacrificing fidelity. Secondly, we design Synergistic Pose Modulation Modules to generate an adaptive and coherent pose representation that is highly compatible with the reference image. Finally, we employ a Staged Decoupled-Objective Training Pipeline that hierarchically optimizes the model for motion fidelity, visual quality, and temporal coherence. Experiments demonstrate that SteadyDancer achieves state-of-the-art performance in both appearance fidelity and motion control, while requiring significantly fewer training resources than comparable methods. The model has been publicly released at \url{https://mcg-nju.github.io/steadydancer-web}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SteadyDancer shifts to an I2V setup with a Condition-Reconciliation Mechanism and pose modules to tackle first-frame identity drift, but the abstract's SOTA claims rest on experiments that need closer checking for large misalignments.

read the letter

SteadyDancer moves human image animation to an I2V setup to better preserve the first frame while adding motion control. The new pieces are the Condition-Reconciliation Mechanism for handling conflicting conditions, the Synergistic Pose Modulation Modules for compatible poses, and a staged training pipeline. This combination is presented as the first robust fix for identity drift in the dominant R2V approach.

Referee Report

2 major / 2 minor

Summary. The paper introduces SteadyDancer, an Image-to-Video (I2V) paradigm framework for human image animation that addresses first-frame identity preservation and motion control. It proposes three main components: a Condition-Reconciliation Mechanism to harmonize conflicting reference and motion conditions, Synergistic Pose Modulation Modules for adaptive pose representations compatible with the reference image, and a Staged Decoupled-Objective Training Pipeline for hierarchical optimization of motion fidelity, visual quality, and temporal coherence. The work claims state-of-the-art performance in appearance fidelity and motion control with significantly fewer training resources than prior Reference-to-Video (R2V) methods, and releases the model publicly.

Significance. If the central claims hold, the shift to an I2V paradigm with explicit reconciliation of conditions could meaningfully improve robustness in human animation tasks, particularly for real-world scenarios with spatio-temporal misalignments. The public model release supports reproducibility and further research in the field.

major comments (2)

[§3.2] §3.2 (Condition-Reconciliation Mechanism): The description of how the mechanism resolves reference-motion conflicts (e.g., via feature fusion or attention) does not include an explicit ablation or analysis isolating its performance under large initial pose/position offsets; this is load-bearing for the robustness claim over R2V baselines and the assertion of no new artifacts or identity drift.
[Experiments section] Experiments section, Table 1 and Figure 4: Quantitative comparisons to baselines are presented, but the manuscript does not report error bars, statistical significance tests, or detailed dataset statistics (e.g., number of sequences with large misalignments), making it difficult to verify the SOTA claims in fidelity and control.

minor comments (2)

[Abstract] The abstract states 'significantly fewer training resources' without specifying metrics such as GPU-hours or parameter counts relative to the compared methods.
[§3.3] Notation for the Synergistic Pose Modulation Modules could be clarified with an equation showing how the adaptive pose representation is computed from the reference image.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects for strengthening the presentation of our robustness claims and experimental rigor. We address each point below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Condition-Reconciliation Mechanism): The description of how the mechanism resolves reference-motion conflicts (e.g., via feature fusion or attention) does not include an explicit ablation or analysis isolating its performance under large initial pose/position offsets; this is load-bearing for the robustness claim over R2V baselines and the assertion of no new artifacts or identity drift.

Authors: We agree that an explicit ablation isolating the Condition-Reconciliation Mechanism under large initial pose/position offsets would provide stronger evidence for our robustness claims and the absence of new artifacts or identity drift. In the revision, we will add a targeted ablation study on a subset of test sequences exhibiting significant misalignments. This will compare the full model against an ablated variant without the reconciliation module, reporting metrics on appearance fidelity and motion control to directly support the mechanism's contribution. revision: yes
Referee: Experiments section, Table 1 and Figure 4: Quantitative comparisons to baselines are presented, but the manuscript does not report error bars, statistical significance tests, or detailed dataset statistics (e.g., number of sequences with large misalignments), making it difficult to verify the SOTA claims in fidelity and control.

Authors: We concur that including error bars, statistical significance tests, and more granular dataset statistics would improve the verifiability of our SOTA claims. We will update the Experiments section and Table 1 to report error bars (e.g., standard deviation across multiple runs), include results of statistical significance tests (such as paired t-tests against baselines), and add details on the test set composition, specifically the count and percentage of sequences with large misalignments. These changes will be reflected in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new mechanisms presented as independent contributions

full rationale

The paper proposes a new I2V paradigm framework with three explicitly introduced components: Condition-Reconciliation Mechanism, Synergistic Pose Modulation Modules, and Staged Decoupled-Objective Training Pipeline. These are described as novel designs to address spatio-temporal misalignments and first-frame preservation, without any equations, fitted parameters, or self-citations that reduce the claimed results to inputs by construction. The abstract and overview frame the improvements as arising from these architectural additions and a staged training pipeline, supported by experiments, rather than re-deriving prior quantities. No load-bearing step equates a prediction or uniqueness claim to a self-defined or previously fitted element.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are described in the abstract; the framework relies on standard deep-learning training assumptions and the I2V paradigm.

pith-pipeline@v0.9.0 · 5775 in / 1127 out tokens · 44129 ms · 2026-05-21T18:16:09.736074+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a Condition-Reconciliation Mechanism to harmonize the two conflicting conditions... Synergistic Pose Modulation Modules... Staged Decoupled-Objective Training Pipeline
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

element-wise addition... channel-wise concatenation... LoRA-based fine-tuning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 6 internal anchors

[1]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jam- pani, and Robin Rombach. Stable video diffusion: Scal- ing latent video diffusion models to large datasets.CoRR, abs/2311.15127, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024. 3

work page 2024
[3]

X-dyna: Ex- pressive dynamic human image animation

Di Chang, Hongyi Xu, You Xie, Yipeng Gao, Zhengfei Kuang, Shengqu Cai, Chenxu Zhang, Guoxian Song, Chao Wang, Yichun Shi, Zeyuan Chen, Shijie Zhou, Linjie Luo, Gordon Wetzstein, and Mohammad Soleymani. X-dyna: Ex- pressive dynamic human image animation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 1...

work page 2025
[4]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation.CoRR, abs/2310.19512, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Wan-animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025

Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Ju Li, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, Ke Sun, Linrui Tian, Feng Wang, Guangyuan Wang, Qi Wang, Zhongjian Wang, Jiayu Xiao, Sheng Xu, Bang Zhang, Peng Zhang, Xindi Zhang, Zhe Zhang, Jingren Zhou, and Lian Zhuo. Wan-animate: Uni- fied character animation and repl...

work page arXiv 2025
[6]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 7

work page 2017
[7]

Image quality metrics: Psnr vs

Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010. 7

work page 2010
[8]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Represen- tations, ICLR 2022, Virtual Event, April 25-29, 2022. Open- Review.net, 2022. 5

work page 2022
[9]

Animate anyone: Consistent and controllable image- to-video synthesis for character animation

Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 8153–8163. IEEE, 2024. 3, 6

work page 2024
[10]

Animate anyone 2: High-fidelity character image animation with environment affordance.arXiv preprint arXiv:2502.06145, 2025

Li Hu, Guangyuan Wang, Zhen Shen, Xin Gao, Dechao Meng, Lian Zhuo, Peng Zhang, Bang Zhang, and Liefeng Bo. Animate anyone 2: High-fidelity character image animation with environment affordance.arXiv preprint arXiv:2502.06145, 2025. 3

work page arXiv 2025
[11]

VBench: Com- prehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Reco...

work page 2024
[12]

Vbench++: Comprehensive and versatile bench- mark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying- Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Zi- wei Liu. Vbench++: Comprehensive and versatile bench- mark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024. 7

work page arXiv 2024
[13]

Learning high fi- delity depths of dressed humans by watching social media dance videos

Yasamin Jafarian and Hyun Soo Park. Learning high fi- delity depths of dressed humans by watching social media dance videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12753– 12762, 2021. 6, 7

work page 2021
[14]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 17191–17202, 2025. 2, 3, 6

work page 2025
[15]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. In2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14- 16, 2014, Conference Track Proceedings, 2014. 3

work page 2014
[16]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Dreamactor-m1: Holistic, expres- sive and robust human image animation with hybrid guid- ance

Yuxuan Luo, Zhengkun Rong, Lizhen Wang, Longhao Zhang, and Tianshu Hu. Dreamactor-m1: Holistic, expres- sive and robust human image animation with hybrid guid- ance. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 11036–11046, 2025. 3, 6

work page 2025
[18]

MIMO: controllable character video synthesis with spa- tial decomposed modeling

Yifang Men, Yuan Yao, Miaomiao Cui, and Liefeng Bo. MIMO: controllable character video synthesis with spa- tial decomposed modeling. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 21181–21191. Computer Vision Foundation / IEEE, 2025. 6

work page 2025
[19]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1- 6, 2023, pages 4172–4182. IEEE, 2023. 3 11

work page 2023
[20]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674– 10685. IEEE, 2022. 3

work page 2022
[21]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Inter- vention - MICCAI 2015 - 18th International Conference Mu- nich, Germany, October 5 - 9, 2015, Proceedings, Part III, pages 234–241. Springer, 2015. 3

work page 2015
[22]

First order motion model for image animation.Advances in neural information processing systems, 32, 2019

Aliaksandr Siarohin, St ´ephane Lathuili`ere, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation.Advances in neural information processing systems, 32, 2019. 3

work page 2019
[23]

Motion representations for ar- ticulated animation

Aliaksandr Siarohin, Oliver J Woodford, Jian Ren, Menglei Chai, and Sergey Tulyakov. Motion representations for ar- ticulated animation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13653–13662, 2021. 3

work page 2021
[24]

X-unimotion: Animating human images with expres- sive, unified and identity-agnostic motion latents.CoRR, abs/2508.09383, 2025

Guoxian Song, Hongyi Xu, Xiaochen Zhao, You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, and Linjie Luo. X-unimotion: Animating human images with expres- sive, unified and identity-agnostic motion latents.CoRR, abs/2508.09383, 2025. 6

work page arXiv 2025
[25]

Animate-x: Universal character image ani- mation with enhanced motion representation

Shuai Tan, Biao Gong, Xiang Wang, Shiwei Zhang, Dandan Zheng, Ruobing Zheng, Kecheng Zheng, Jingdong Chen, and Ming Yang. Animate-x: Universal character image ani- mation with enhanced motion representation. InThe Thir- teenth International Conference on Learning Representa- tions, ICLR 2025, Singapore, April 24-28, 2025. OpenRe- view.net, 2025. 6

work page 2025
[26]

Stableanimator: High- quality identity-preserving human image animation

Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu. Stableanimator: High- quality identity-preserving human image animation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 21096–21106, 2025. 3, 6

work page 2025
[27]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 7

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Xiaofeng Meng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing Wa...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Disco: Disentangled control for realistic human dance generation

Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung- Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, and Lijuan Wang. Disco: Disentangled control for realistic human dance generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9326–9336, 2024. 3, 6

work page 2024
[30]

Unianimate-dit: Human image animation with large-scale video diffusion transformer.arXiv preprint arXiv:2504.11289, 2025

Xiang Wang, Shiwei Zhang, Longxiang Tang, Yingya Zhang, Changxin Gao, Yuehuan Wang, and Nong Sang. Unianimate-dit: Human image animation with large-scale video diffusion transformer.CoRR, abs/2504.11289, 2025. 3, 6

work page arXiv 2025
[31]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 7

work page 2004
[32]

Humanvid: Demystifying train- ing data for camera-controllable human image animation

Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan, Kai Chen, Tianfan Xue, Bo Dai, and Dahua Lin. Humanvid: Demystifying train- ing data for camera-controllable human image animation. In Advances in Neural Information Processing Systems 38: An- nual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, V...

work page 2024
[33]

Hypermotion: Dit- based pose-guided human image animation of complex mo- tions.CoRR, abs/2505.22977, 2025

Shuolin Xu, Siming Zheng, Ziyi Wang, HC Yu, Jinwei Chen, Huaqi Zhang, Bo Li, and Peng-Tao Jiang. Hypermotion: Dit- based pose-guided human image animation of complex mo- tions.CoRR, abs/2505.22977, 2025. 3, 6

work page arXiv 2025
[34]

Magicanimate: Temporally consistent human im- age animation using diffusion model

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human im- age animation using diffusion model. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 1481–

work page 2024
[35]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Confer- ence on Learning Repr...

work page 2025
[36]

Hauptmann, Ming- Hsuan Yang, Yuan Hao, Irfan Essa, and Lu Jiang

Lijun Yu, Yong Cheng, Kihyuk Sohn, Jos ´e Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming- Hsuan Yang, Yuan Hao, Irfan Essa, and Lu Jiang. MAGVIT: masked generative video transformer. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 10459–10469. IEEE, 2023. 3

work page 2023
[37]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 3813–

work page 2023
[38]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the 12 IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 7

work page 2018
[39]

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qing, Xiang Wang, Deli Zhao, and Jin- gren Zhou. I2vgen-xl: High-quality image-to-video synthe- sis via cascaded diffusion models.CoRR, abs/2311.04145,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Flexiact: Towards flexible action control in heterogeneous scenarios

Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, and Yansong Tang. Flexiact: Towards flexible action control in heterogeneous scenarios. InProceedings of the Special Interest Group on Computer Graphics and Interactive Tech- niques Conference Conference Papers, pages 1–11, 2025. 3, 6

work page 2025
[41]

Mim- icmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024

Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, and Fangyuan Zou. Mimicmo- tion: High-quality human motion video generation with confidence-aware pose guidance.CoRR, abs/2406.19680,

work page arXiv
[42]

Thin-plate spline motion model for image animation

Jian Zhao and Hui Zhang. Thin-plate spline motion model for image animation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3657–3666, 2022. 3

work page 2022
[43]

Realisdance: Equip controllable character anima- tion with realistic hands.arXiv preprint arXiv:2409.06202,

Jingkai Zhou, Benzhi Wang, Weihua Chen, Jingqi Bai, Dongyang Li, Aixi Zhang, Hao Xu, Mingyang Yang, and Fan Wang. Realisdance: Equip controllable character anima- tion with realistic hands.arXiv preprint arXiv:2409.06202,

work page arXiv
[44]

Realisdance-dit: Sim- ple yet strong baseline towards controllable character anima- tion in the wild.CoRR, abs/2504.14977, 2025

Jingkai Zhou, Yifan Wu, Shikai Li, Min Wei, Chao Fan, Wei- hua Chen, Wei Jiang, and Fan Wang. Realisdance-dit: Sim- ple yet strong baseline towards controllable character anima- tion in the wild.CoRR, abs/2504.14977, 2025. 3, 6, 7

work page arXiv 2025
[45]

Champ: Controllable and consistent human image an- imation with 3d parametric guidance

Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Zilong Dong, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu. Champ: Controllable and consistent human image an- imation with 3d parametric guidance. InEuropean Confer- ence on Computer Vision, pages 145–162. Springer, 2024. 3, 6 13 SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame...

work page 2024
[46]

This idealized setup fails to reflect the spatio- temporal misalignment challenges prevalent in real-world applications

X-Dance Standard benchmarks, such as TikTok and RealisDance, source both the reference image and pose sequence from the same video. This idealized setup fails to reflect the spatio- temporal misalignment challenges prevalent in real-world applications. As shown in Fig. 14, to more robustly eval- uate the model’s generalization capabilities in such scenar-...

work page
[47]

Motion Discontinuity Mitigation

Model Details 7.1. Motion Discontinuity Mitigation. As discussed in the main text, to address the abrupt transi- tion between the reference frame and the initial pose frame, we proposePose Simulationto explicitly replicate this dis- continuity within the training data. Specifically, given a smooth training sequence{p 0, p1, . . . , pT }, we first con- str...

work page
[48]

1) Domain Gap in Stylized Im- ages

Limitation and Future Work Despite the promising results achieved by SteadyDancer in harmonized and coherent animation, several limitations remain to be addressed. 1) Domain Gap in Stylized Im- ages. While our model delivers visually pleasing and co- herent results for anime reference frames, its performance remains slightly inferior to the exceptional fi...

work page

[1] [1]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jam- pani, and Robin Rombach. Stable video diffusion: Scal- ing latent video diffusion models to large datasets.CoRR, abs/2311.15127, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024. 3

work page 2024

[3] [3]

X-dyna: Ex- pressive dynamic human image animation

Di Chang, Hongyi Xu, You Xie, Yipeng Gao, Zhengfei Kuang, Shengqu Cai, Chenxu Zhang, Guoxian Song, Chao Wang, Yichun Shi, Zeyuan Chen, Shijie Zhou, Linjie Luo, Gordon Wetzstein, and Mohammad Soleymani. X-dyna: Ex- pressive dynamic human image animation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 1...

work page 2025

[4] [4]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation.CoRR, abs/2310.19512, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Wan-animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025

Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Ju Li, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, Ke Sun, Linrui Tian, Feng Wang, Guangyuan Wang, Qi Wang, Zhongjian Wang, Jiayu Xiao, Sheng Xu, Bang Zhang, Peng Zhang, Xindi Zhang, Zhe Zhang, Jingren Zhou, and Lian Zhuo. Wan-animate: Uni- fied character animation and repl...

work page arXiv 2025

[6] [6]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 7

work page 2017

[7] [7]

Image quality metrics: Psnr vs

Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010. 7

work page 2010

[8] [8]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Represen- tations, ICLR 2022, Virtual Event, April 25-29, 2022. Open- Review.net, 2022. 5

work page 2022

[9] [9]

Animate anyone: Consistent and controllable image- to-video synthesis for character animation

Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 8153–8163. IEEE, 2024. 3, 6

work page 2024

[10] [10]

Animate anyone 2: High-fidelity character image animation with environment affordance.arXiv preprint arXiv:2502.06145, 2025

Li Hu, Guangyuan Wang, Zhen Shen, Xin Gao, Dechao Meng, Lian Zhuo, Peng Zhang, Bang Zhang, and Liefeng Bo. Animate anyone 2: High-fidelity character image animation with environment affordance.arXiv preprint arXiv:2502.06145, 2025. 3

work page arXiv 2025

[11] [11]

VBench: Com- prehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Reco...

work page 2024

[12] [12]

Vbench++: Comprehensive and versatile bench- mark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying- Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Zi- wei Liu. Vbench++: Comprehensive and versatile bench- mark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024. 7

work page arXiv 2024

[13] [13]

Learning high fi- delity depths of dressed humans by watching social media dance videos

Yasamin Jafarian and Hyun Soo Park. Learning high fi- delity depths of dressed humans by watching social media dance videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12753– 12762, 2021. 6, 7

work page 2021

[14] [14]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 17191–17202, 2025. 2, 3, 6

work page 2025

[15] [15]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. In2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14- 16, 2014, Conference Track Proceedings, 2014. 3

work page 2014

[16] [16]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Dreamactor-m1: Holistic, expres- sive and robust human image animation with hybrid guid- ance

Yuxuan Luo, Zhengkun Rong, Lizhen Wang, Longhao Zhang, and Tianshu Hu. Dreamactor-m1: Holistic, expres- sive and robust human image animation with hybrid guid- ance. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 11036–11046, 2025. 3, 6

work page 2025

[18] [18]

MIMO: controllable character video synthesis with spa- tial decomposed modeling

Yifang Men, Yuan Yao, Miaomiao Cui, and Liefeng Bo. MIMO: controllable character video synthesis with spa- tial decomposed modeling. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 21181–21191. Computer Vision Foundation / IEEE, 2025. 6

work page 2025

[19] [19]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1- 6, 2023, pages 4172–4182. IEEE, 2023. 3 11

work page 2023

[20] [20]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674– 10685. IEEE, 2022. 3

work page 2022

[21] [21]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Inter- vention - MICCAI 2015 - 18th International Conference Mu- nich, Germany, October 5 - 9, 2015, Proceedings, Part III, pages 234–241. Springer, 2015. 3

work page 2015

[22] [22]

First order motion model for image animation.Advances in neural information processing systems, 32, 2019

Aliaksandr Siarohin, St ´ephane Lathuili`ere, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation.Advances in neural information processing systems, 32, 2019. 3

work page 2019

[23] [23]

Motion representations for ar- ticulated animation

Aliaksandr Siarohin, Oliver J Woodford, Jian Ren, Menglei Chai, and Sergey Tulyakov. Motion representations for ar- ticulated animation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13653–13662, 2021. 3

work page 2021

[24] [24]

X-unimotion: Animating human images with expres- sive, unified and identity-agnostic motion latents.CoRR, abs/2508.09383, 2025

Guoxian Song, Hongyi Xu, Xiaochen Zhao, You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, and Linjie Luo. X-unimotion: Animating human images with expres- sive, unified and identity-agnostic motion latents.CoRR, abs/2508.09383, 2025. 6

work page arXiv 2025

[25] [25]

Animate-x: Universal character image ani- mation with enhanced motion representation

Shuai Tan, Biao Gong, Xiang Wang, Shiwei Zhang, Dandan Zheng, Ruobing Zheng, Kecheng Zheng, Jingdong Chen, and Ming Yang. Animate-x: Universal character image ani- mation with enhanced motion representation. InThe Thir- teenth International Conference on Learning Representa- tions, ICLR 2025, Singapore, April 24-28, 2025. OpenRe- view.net, 2025. 6

work page 2025

[26] [26]

Stableanimator: High- quality identity-preserving human image animation

Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu. Stableanimator: High- quality identity-preserving human image animation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 21096–21106, 2025. 3, 6

work page 2025

[27] [27]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 7

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Xiaofeng Meng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing Wa...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Disco: Disentangled control for realistic human dance generation

Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung- Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, and Lijuan Wang. Disco: Disentangled control for realistic human dance generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9326–9336, 2024. 3, 6

work page 2024

[30] [30]

Unianimate-dit: Human image animation with large-scale video diffusion transformer.arXiv preprint arXiv:2504.11289, 2025

Xiang Wang, Shiwei Zhang, Longxiang Tang, Yingya Zhang, Changxin Gao, Yuehuan Wang, and Nong Sang. Unianimate-dit: Human image animation with large-scale video diffusion transformer.CoRR, abs/2504.11289, 2025. 3, 6

work page arXiv 2025

[31] [31]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 7

work page 2004

[32] [32]

Humanvid: Demystifying train- ing data for camera-controllable human image animation

Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan, Kai Chen, Tianfan Xue, Bo Dai, and Dahua Lin. Humanvid: Demystifying train- ing data for camera-controllable human image animation. In Advances in Neural Information Processing Systems 38: An- nual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, V...

work page 2024

[33] [33]

Hypermotion: Dit- based pose-guided human image animation of complex mo- tions.CoRR, abs/2505.22977, 2025

Shuolin Xu, Siming Zheng, Ziyi Wang, HC Yu, Jinwei Chen, Huaqi Zhang, Bo Li, and Peng-Tao Jiang. Hypermotion: Dit- based pose-guided human image animation of complex mo- tions.CoRR, abs/2505.22977, 2025. 3, 6

work page arXiv 2025

[34] [34]

Magicanimate: Temporally consistent human im- age animation using diffusion model

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human im- age animation using diffusion model. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 1481–

work page 2024

[35] [35]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Confer- ence on Learning Repr...

work page 2025

[36] [36]

Hauptmann, Ming- Hsuan Yang, Yuan Hao, Irfan Essa, and Lu Jiang

Lijun Yu, Yong Cheng, Kihyuk Sohn, Jos ´e Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming- Hsuan Yang, Yuan Hao, Irfan Essa, and Lu Jiang. MAGVIT: masked generative video transformer. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 10459–10469. IEEE, 2023. 3

work page 2023

[37] [37]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 3813–

work page 2023

[38] [38]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the 12 IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 7

work page 2018

[39] [39]

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qing, Xiang Wang, Deli Zhao, and Jin- gren Zhou. I2vgen-xl: High-quality image-to-video synthe- sis via cascaded diffusion models.CoRR, abs/2311.04145,

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

Flexiact: Towards flexible action control in heterogeneous scenarios

Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, and Yansong Tang. Flexiact: Towards flexible action control in heterogeneous scenarios. InProceedings of the Special Interest Group on Computer Graphics and Interactive Tech- niques Conference Conference Papers, pages 1–11, 2025. 3, 6

work page 2025

[41] [41]

Mim- icmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024

Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, and Fangyuan Zou. Mimicmo- tion: High-quality human motion video generation with confidence-aware pose guidance.CoRR, abs/2406.19680,

work page arXiv

[42] [42]

Thin-plate spline motion model for image animation

Jian Zhao and Hui Zhang. Thin-plate spline motion model for image animation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3657–3666, 2022. 3

work page 2022

[43] [43]

Realisdance: Equip controllable character anima- tion with realistic hands.arXiv preprint arXiv:2409.06202,

Jingkai Zhou, Benzhi Wang, Weihua Chen, Jingqi Bai, Dongyang Li, Aixi Zhang, Hao Xu, Mingyang Yang, and Fan Wang. Realisdance: Equip controllable character anima- tion with realistic hands.arXiv preprint arXiv:2409.06202,

work page arXiv

[44] [44]

Realisdance-dit: Sim- ple yet strong baseline towards controllable character anima- tion in the wild.CoRR, abs/2504.14977, 2025

Jingkai Zhou, Yifan Wu, Shikai Li, Min Wei, Chao Fan, Wei- hua Chen, Wei Jiang, and Fan Wang. Realisdance-dit: Sim- ple yet strong baseline towards controllable character anima- tion in the wild.CoRR, abs/2504.14977, 2025. 3, 6, 7

work page arXiv 2025

[45] [45]

Champ: Controllable and consistent human image an- imation with 3d parametric guidance

Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Zilong Dong, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu. Champ: Controllable and consistent human image an- imation with 3d parametric guidance. InEuropean Confer- ence on Computer Vision, pages 145–162. Springer, 2024. 3, 6 13 SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame...

work page 2024

[46] [46]

This idealized setup fails to reflect the spatio- temporal misalignment challenges prevalent in real-world applications

X-Dance Standard benchmarks, such as TikTok and RealisDance, source both the reference image and pose sequence from the same video. This idealized setup fails to reflect the spatio- temporal misalignment challenges prevalent in real-world applications. As shown in Fig. 14, to more robustly eval- uate the model’s generalization capabilities in such scenar-...

work page

[47] [47]

Motion Discontinuity Mitigation

Model Details 7.1. Motion Discontinuity Mitigation. As discussed in the main text, to address the abrupt transi- tion between the reference frame and the initial pose frame, we proposePose Simulationto explicitly replicate this dis- continuity within the training data. Specifically, given a smooth training sequence{p 0, p1, . . . , pT }, we first con- str...

work page

[48] [48]

1) Domain Gap in Stylized Im- ages

Limitation and Future Work Despite the promising results achieved by SteadyDancer in harmonized and coherent animation, several limitations remain to be addressed. 1) Domain Gap in Stylized Im- ages. While our model delivers visually pleasing and co- herent results for anime reference frames, its performance remains slightly inferior to the exceptional fi...

work page