FreeAnimate: Training-Free Human Image Animation with Preview-Guided Denoising

Qingmin Liao; Yuan Zeng; Yujia Shi; Zongqing Lu

arxiv: 2606.06885 · v1 · pith:D73BXQKPnew · submitted 2026-06-05 · 💻 cs.CV · cs.AI

FreeAnimate: Training-Free Human Image Animation with Preview-Guided Denoising

Yuan Zeng , Yujia Shi , Zongqing Lu , QingMin Liao This is my paper

Pith reviewed 2026-06-27 22:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords human image animationtraining-free methoddiffusion modelspreview generationtemporal consistencyidentity preservationattention modules

0 comments

The pith

A preview generation strategy plus two attention modules lets pre-trained diffusion models animate human images without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that human image animation can be performed at high quality using only existing image diffusion models by generating preview frames first to supply pose and background guidance during denoising. Two new attention modules are introduced to enforce frame-to-frame consistency and keep the subject's identity intact. The authors argue this removes the need for large training datasets and compute while still matching or exceeding the results of methods that do require training, and that it works across varied input images and datasets.

Core claim

FreeAnimate produces temporally consistent, identity-preserving human animations by first creating preview frames that serve as structural and temporal priors for the denoising process, then applying Inversion-Boosted Attention to stabilize motion across frames and Reference-Anchored Self-Attention to lock the reference identity, all within an unmodified pre-trained diffusion model.

What carries the argument

The preview generation strategy that supplies temporal and structural priors to guide denoising, combined with Inversion-Boosted Attention for temporal consistency and Reference-Anchored Self-Attention for identity preservation.

If this is right

Standard image diffusion models can be used directly for video-like animation tasks without domain-specific training.
Generalization across datasets becomes feasible because no dataset-specific fine-tuning is required.
Background stability and pose alignment can be controlled through generated priors rather than learned temporal layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar preview-based guidance might extend to other generation tasks where temporal structure is needed but training data is scarce.
If preview quality improves with better base models, the overall animation quality would rise without changes to the FreeAnimate pipeline.
The approach implies that explicit structural priors can substitute for implicit learning of motion in some controlled settings.

Load-bearing premise

The preview frames supply accurate enough pose, motion, and background information to steer the rest of the denoising steps without any model updates or extra training data.

What would settle it

Running the method on a held-out set of diverse human poses and backgrounds and finding that motion artifacts or identity drift appear at rates comparable to or worse than prior training-free baselines would falsify the claim that the preview priors are sufficient.

read the original abstract

Human Image Animation has seen significant advancements, primarily driven by diffusion models. However, existing methods typically demand substantial training data and resources to achieve high-quality results, limiting generalization and accessibility. In this work, we introduce \emph{FreeAnimate}, a training-free framework that leverages the inherent capabilities of image diffusion models to enable temporal consistency, identity preservation, and background stability. Our approach incorporates a novel preview generation strategy that provides temporal and structural priors from generated preview frames, effectively guiding pose alignment and background consistency without training. Additionally, FreeAnimate introduces Inversion-Boosted Attention and Reference-Anchored Self-Attention modules to guarantee temporal consistency and identity preservation. Experimental results demonstrate that FreeAnimate outperforms existing training-free competitors and training-based baseline methods, achieving generation quality comparable to state-of-the-art methods and offering robust generalization across diverse datasets. Our project page is at https://freeani.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FreeAnimate claims a training-free human animation pipeline via preview-guided denoising plus two attention modules, but the abstract's performance assertions have no visible numbers or ablations to evaluate.

read the letter

The main point for you is that this paper puts forward a training-free route to human image animation in diffusion models. It generates preview frames first to supply temporal and structural guidance, then applies Inversion-Boosted Attention and Reference-Anchored Self-Attention to enforce consistency and identity without any fine-tuning.

What is actually new is the particular pairing of the preview strategy with those two attention modules. The work does a reasonable job spelling out the practical problem—most prior animation methods need heavy training data and compute, which hurts generalization—and it sketches a way around that using only the base model's inversion and attention controls.

The soft spot is the evidence. The abstract states that the method beats existing training-free competitors, matches training-based baselines, and generalizes well across datasets. Yet it supplies none of the usual supporting material: no quantitative metrics, no named baselines, no ablation tables, and no error analysis. That leaves the central assumption—that preview frames plus the attention tweaks reliably deliver temporal consistency and identity preservation on arbitrary inputs—untested in the summary we have. The method description itself lines up with known diffusion inversion and attention techniques, so there is no obvious internal contradiction.

This paper is aimed at researchers working on diffusion-based video or animation who want lower training costs. A reader already following training-free diffusion control papers would get the most out of it, provided the full experiments check out. It shows straightforward engagement with the literature and a clear goal.

I would send it to peer review so the results and implementation details can be examined directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces FreeAnimate, a training-free framework for human image animation based on image diffusion models. It proposes a preview generation strategy to supply temporal and structural priors for pose alignment and background consistency, along with two attention modules (Inversion-Boosted Attention and Reference-Anchored Self-Attention) to enforce temporal consistency and identity preservation. The central claim is that this pipeline outperforms existing training-free competitors, matches or exceeds some training-based baselines, achieves quality comparable to state-of-the-art methods, and generalizes robustly across diverse datasets without any training.

Significance. If the experimental claims are substantiated, the work would be significant for demonstrating that training-free techniques can achieve competitive results in human animation, thereby reducing dependence on large annotated datasets and computational resources for fine-tuning. This aligns with broader efforts in diffusion-based generative modeling to improve accessibility and generalization.

major comments (2)

[Abstract] Abstract: the assertion that 'Experimental results demonstrate that FreeAnimate outperforms existing training-free competitors and training-based baseline methods, achieving generation quality comparable to state-of-the-art methods' is load-bearing for the central claim but supplies no quantitative metrics, baseline names, ablation details, or error analysis, making it impossible to evaluate whether the data support the stated superiority and generalization.
[§3 (Method)] The weakest assumption—that the preview generation strategy plus the two attention modules reliably enforce consistency and identity preservation on arbitrary inputs without training—requires explicit validation; without ablations isolating each component's contribution (e.g., with/without preview guidance or each attention module), the internal consistency of the pipeline cannot be assessed.

minor comments (2)

The project page URL is provided, which aids reproducibility; consider also releasing code or detailed prompts used in the preview generation.
[§3.2] Clarify the exact inversion technique referenced in 'Inversion-Boosted Attention' and how it differs from standard DDIM inversion, as this notation may be unclear to readers unfamiliar with the specific implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that 'Experimental results demonstrate that FreeAnimate outperforms existing training-free competitors and training-based baseline methods, achieving generation quality comparable to state-of-the-art methods' is load-bearing for the central claim but supplies no quantitative metrics, baseline names, ablation details, or error analysis, making it impossible to evaluate whether the data support the stated superiority and generalization.

Authors: We agree that the abstract statement is too high-level and lacks supporting specifics. In the revised manuscript we will update the abstract to include concrete quantitative metrics from our experiments (such as FID, FVD, and temporal consistency scores), explicitly name the training-free and training-based baselines used for comparison, and briefly reference the ablation results that underpin the claims of superiority and generalization. revision: yes
Referee: [§3 (Method)] The weakest assumption—that the preview generation strategy plus the two attention modules reliably enforce consistency and identity preservation on arbitrary inputs without training—requires explicit validation; without ablations isolating each component's contribution (e.g., with/without preview guidance or each attention module), the internal consistency of the pipeline cannot be assessed.

Authors: We acknowledge the need for explicit component-wise validation. While the current manuscript presents the overall pipeline and comparative results, we will add dedicated ablation studies in the revised version. These will isolate the preview generation strategy (with/without) as well as each attention module individually, reporting quantitative effects on temporal consistency and identity preservation across diverse, arbitrary inputs. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a training-free animation pipeline relying on preview generation plus two attention modules applied to off-the-shelf diffusion models. All load-bearing claims are validated by direct experimental comparison against external baselines rather than by any derivation, fitted parameter, or self-referential definition. No equations appear in the supplied text, no self-citation chain is invoked to justify uniqueness, and the method description is consistent with standard diffusion inversion and attention-control techniques already present in the literature. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted or audited from the provided text.

pith-pipeline@v0.9.1-grok · 5685 in / 1007 out tokens · 22914 ms · 2026-06-27T22:54:41.907789+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 4 linked inside Pith

[1]

HIA [1, 2, 3] holds strong potential across various fields, including social media, entertainment industry, and video game development

INTRODUCTION The development of diffusion models has greatly advanced applica- tions in content generation, with human-centered themes like Hu- man Image Animation (HIA) receiving particular attention. HIA [1, 2, 3] holds strong potential across various fields, including social media, entertainment industry, and video game development. Initial attempts wi...

Pith/arXiv arXiv 2026
[2]

, pN ]∈R N×H×W×3 , where N is the number of frames

METHOD Given a reference imageI ref ∈R H×W×3 that contains an identity, and a pose sequencep 1:N = [p 1, . . . , pN ]∈R N×H×W×3 , where N is the number of frames. Human image animation aims to generate a temporally coherent videoI 1:N = [I 1, . . . , IN ]in which both iden- tity and background appearance follow the reference image, while the identity pose...
[3]

MRAA [1] and TPS [2] are state-of-the-art GAN-based methods

EXPERIMENTS Baselines and Benchmarks.We chose a diverse set of state-of- the-art HIA methods for comparison. MRAA [1] and TPS [2] are state-of-the-art GAN-based methods. Diffusion-based methods are grouped by training data: (1) DisCo [3], MagicPose [9], MagicAn- imate [8], and AnimateAnyone [10] use only public datasets; (2) Champ [12], MimicMotion [11], ...
[4]

Implementation Details.We use DWPose [29] for pose extraction and StableAnimator’s [14] algorithm for alignment

consists of full-body videos of five subjects and is used to assess the generalization of our method to full-body motions. Implementation Details.We use DWPose [29] for pose extraction and StableAnimator’s [14] algorithm for alignment. The first frame is the reference, and others are driving frames, resized to 512 × 512. The DDIM sampler is configured wit...
[5]

CONCLUSION We present FreeAnimate, a training-free framework for HIA, ad- dressing limitations associated with data- and resource-intensive HIA methods. By integrating a Preview Generation Strategy with a training-free model architecture, FreeAnimate achieves high fidelity in pose-guided human image animation without relying on extensive training datasets...
[6]

ACKNOWLEDGEMENTS This work is supported by the National Natural Science Foundation of China (U23B2030)
[7]

Motion representations for ar- ticulated animation,

Aliaksandr Siarohin, Oliver J Woodford, Jian Ren, Menglei Chai, and Sergey Tulyakov, “Motion representations for ar- ticulated animation,” inCVPR, 2021, pp. 13653–13662

2021
[8]

Thin-plate spline motion model for image animation,

Jian Zhao and Hui Zhang, “Thin-plate spline motion model for image animation,” inCVPR, 2022, pp. 3657–3666

2022
[9]

Disco: Disentangled control for realistic human dance generation,

Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, and Li- juan Wang, “Disco: Disentangled control for realistic human dance generation,” inCVPR, 2024, pp. 9326–9336

2024
[10]

Generative adversarial networks,

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial networks,”Commu- nications of the ACM, vol. 63, no. 11, pp. 139–144, 2020

2020
[11]

Denoising diffu- sion probabilistic models,

Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffu- sion probabilistic models,”NeurIPS, vol. 33, pp. 6840–6851, 2020

2020
[12]

Denoising diffusion implicit models,

Jiaming Song, Chenlin Meng, and Stefano Ermon, “Denoising diffusion implicit models,”arXiv preprint arXiv:2010.02502, 2020

Pith/arXiv arXiv 2010
[13]

Adding conditional control to text-to-image diffusion models,

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala, “Adding conditional control to text-to-image diffusion models,” in ICCV, 2023, pp. 3836–3847

2023
[14]

Magicanimate: Temporally consistent human image animation using diffusion model,

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou, “Magicanimate: Temporally consistent human image animation using diffusion model,” inCVPR, 2024, pp. 1481– 1490

2024
[15]

Magicpose: Realistic human poses and fa- cial expressions retargeting with identity-aware diffusion,

Di Chang, Yichun Shi, Quankai Gao, Hongyi Xu, Jessica Fu, Guoxian Song, Qing Yan, Yizhe Zhu, Xiao Yang, and Moham- mad Soleymani, “Magicpose: Realistic human poses and fa- cial expressions retargeting with identity-aware diffusion,” in ICML, 2023

2023
[16]

Animate anyone: Consistent and controllable image- to-video synthesis for character animation,

Li Hu, “Animate anyone: Consistent and controllable image- to-video synthesis for character animation,” inCVPR, 2024, pp. 8153–8163

2024
[17]

Mimicmotion: High-quality human motion video generation with confidence- aware pose guidance,

Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, and Fangyuan Zou, “Mimicmotion: High-quality human motion video generation with confidence- aware pose guidance,”arXiv preprint arXiv:2406.19680, 2024

arXiv 2024
[18]

Champ: Con- trollable and consistent human image animation with 3d para- metric guidance,

Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu, “Champ: Con- trollable and consistent human image animation with 3d para- metric guidance,”arXiv preprint arXiv:2403.14781, 2024

arXiv 2024
[19]

Smpl: A skinned multi- person linear model,

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black, “Smpl: A skinned multi- person linear model,” inSeminal Graphics Papers: Pushing the Boundaries, V olume 2, pp. 851–866. 2023

2023
[20]

Stableanimator: High-quality identity-preserving human image animation,

Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu, “Stableanimator: High-quality identity-preserving human image animation,”arXiv preprint arXiv:2411.17697, 2024

arXiv 2024
[21]

Zero-shot high-fidelity and pose-controllable character animation,

Bingwen Zhu, Fanyi Wang, Tianyi Lu, Peng Liu, Jingwen Su, Jinxiu Liu, Yanhao Zhang, Zuxuan Wu, Guo-Jun Qi, and Yu- Gang Jiang, “Zero-shot high-fidelity and pose-controllable character animation,”arXiv preprint arXiv:2404.13680, 2024

arXiv 2024
[22]

Animatediff: Animate your personalized text-to- image diffusion models without specific tuning,

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yao- hui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai, “Animatediff: Animate your personalized text-to- image diffusion models without specific tuning,”ICLR, 2024

2024
[23]

Grounded sam: Assembling open-world models for diverse visual tasks,

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al., “Grounded sam: Assembling open-world models for diverse visual tasks,”arXiv preprint arXiv:2401.14159, 2024

Pith/arXiv arXiv 2024
[24]

Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing,

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng, “Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing,” inICCV, 2023, pp. 22560–22570

2023
[25]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan, “T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” inAAAI, 2024, vol. 38, pp. 4296–4304

2024
[26]

Mat: Mask-aware transformer for large hole image inpaint- ing,

Wenbo Li, Zhe Lin, Kun Zhou, Lu Qi, Yi Wang, and Jiaya Jia, “Mat: Mask-aware transformer for large hole image inpaint- ing,” inCVPR, 2022, pp. 10758–10768

2022
[27]

High-resolution image synthesis with latent diffusion models,

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer, “High-resolution image synthesis with latent diffusion models,” inCVPR, 2022, pp. 10684– 10695

2022
[28]

Follow-your-pose v2: Multiple-condition guided char- acter image animation for stable pose control,

Jingyun Xue, Hongfa Wang, Qi Tian, Yue Ma, Andong Wang, Zhiyuan Zhao, Shaobo Min, Wenzhe Zhao, Kaihao Zhang, Heung-Yeung Shum, Wei Liu, Mengyang Liu, and Wenhan Luo, “Follow-your-pose v2: Multiple-condition guided char- acter image animation for stable pose control,” 2024

2024
[29]

Ablation study: Why controlnets use deep encoder? what if it was lighter? or even an mlp?,

Lyumin Zhang, “Ablation study: Why controlnets use deep encoder? what if it was lighter? or even an mlp?,”https://github.com/lllyasviel/ ControlNet/discussions/188, Accessed: Feb. 28, 2023

2023
[30]

Fatezero: Fus- ing attentions for zero-shot text-based video editing,

Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen, “Fatezero: Fus- ing attentions for zero-shot text-based video editing,” inICCV, 2023, pp. 15932–15942

2023
[31]

Prompt-to-prompt im- age editing with cross attention control,

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or, “Prompt-to-prompt im- age editing with cross attention control,”arXiv preprint arXiv:2208.01626, 2022

Pith/arXiv arXiv 2022
[32]

Pix2video: Video editing using image diffusion,

Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra, “Pix2video: Video editing using image diffusion,” inICCV, 2023, pp. 23206–23217

2023
[33]

Learning high fidelity depths of dressed humans by watching social media dance videos,

Yasamin Jafarian and Hyun Soo Park, “Learning high fidelity depths of dressed humans by watching social media dance videos,” inCVPR, 2021, pp. 12753–12762

2021
[34]

Everybody dance now,

Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros, “Everybody dance now,” inICCV, 2019, pp. 5933– 5942

2019
[35]

Effec- tive whole-body pose estimation with two-stages distillation,

Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li, “Effec- tive whole-body pose estimation with two-stages distillation,” inICCV, 2023, pp. 4210–4220

2023

[1] [1]

HIA [1, 2, 3] holds strong potential across various fields, including social media, entertainment industry, and video game development

INTRODUCTION The development of diffusion models has greatly advanced applica- tions in content generation, with human-centered themes like Hu- man Image Animation (HIA) receiving particular attention. HIA [1, 2, 3] holds strong potential across various fields, including social media, entertainment industry, and video game development. Initial attempts wi...

Pith/arXiv arXiv 2026

[2] [2]

, pN ]∈R N×H×W×3 , where N is the number of frames

METHOD Given a reference imageI ref ∈R H×W×3 that contains an identity, and a pose sequencep 1:N = [p 1, . . . , pN ]∈R N×H×W×3 , where N is the number of frames. Human image animation aims to generate a temporally coherent videoI 1:N = [I 1, . . . , IN ]in which both iden- tity and background appearance follow the reference image, while the identity pose...

[3] [3]

MRAA [1] and TPS [2] are state-of-the-art GAN-based methods

EXPERIMENTS Baselines and Benchmarks.We chose a diverse set of state-of- the-art HIA methods for comparison. MRAA [1] and TPS [2] are state-of-the-art GAN-based methods. Diffusion-based methods are grouped by training data: (1) DisCo [3], MagicPose [9], MagicAn- imate [8], and AnimateAnyone [10] use only public datasets; (2) Champ [12], MimicMotion [11], ...

[4] [4]

Implementation Details.We use DWPose [29] for pose extraction and StableAnimator’s [14] algorithm for alignment

consists of full-body videos of five subjects and is used to assess the generalization of our method to full-body motions. Implementation Details.We use DWPose [29] for pose extraction and StableAnimator’s [14] algorithm for alignment. The first frame is the reference, and others are driving frames, resized to 512 × 512. The DDIM sampler is configured wit...

[5] [5]

CONCLUSION We present FreeAnimate, a training-free framework for HIA, ad- dressing limitations associated with data- and resource-intensive HIA methods. By integrating a Preview Generation Strategy with a training-free model architecture, FreeAnimate achieves high fidelity in pose-guided human image animation without relying on extensive training datasets...

[6] [6]

ACKNOWLEDGEMENTS This work is supported by the National Natural Science Foundation of China (U23B2030)

[7] [7]

Motion representations for ar- ticulated animation,

Aliaksandr Siarohin, Oliver J Woodford, Jian Ren, Menglei Chai, and Sergey Tulyakov, “Motion representations for ar- ticulated animation,” inCVPR, 2021, pp. 13653–13662

2021

[8] [8]

Thin-plate spline motion model for image animation,

Jian Zhao and Hui Zhang, “Thin-plate spline motion model for image animation,” inCVPR, 2022, pp. 3657–3666

2022

[9] [9]

Disco: Disentangled control for realistic human dance generation,

Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, and Li- juan Wang, “Disco: Disentangled control for realistic human dance generation,” inCVPR, 2024, pp. 9326–9336

2024

[10] [10]

Generative adversarial networks,

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial networks,”Commu- nications of the ACM, vol. 63, no. 11, pp. 139–144, 2020

2020

[11] [11]

Denoising diffu- sion probabilistic models,

Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffu- sion probabilistic models,”NeurIPS, vol. 33, pp. 6840–6851, 2020

2020

[12] [12]

Denoising diffusion implicit models,

Jiaming Song, Chenlin Meng, and Stefano Ermon, “Denoising diffusion implicit models,”arXiv preprint arXiv:2010.02502, 2020

Pith/arXiv arXiv 2010

[13] [13]

Adding conditional control to text-to-image diffusion models,

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala, “Adding conditional control to text-to-image diffusion models,” in ICCV, 2023, pp. 3836–3847

2023

[14] [14]

Magicanimate: Temporally consistent human image animation using diffusion model,

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou, “Magicanimate: Temporally consistent human image animation using diffusion model,” inCVPR, 2024, pp. 1481– 1490

2024

[15] [15]

Magicpose: Realistic human poses and fa- cial expressions retargeting with identity-aware diffusion,

Di Chang, Yichun Shi, Quankai Gao, Hongyi Xu, Jessica Fu, Guoxian Song, Qing Yan, Yizhe Zhu, Xiao Yang, and Moham- mad Soleymani, “Magicpose: Realistic human poses and fa- cial expressions retargeting with identity-aware diffusion,” in ICML, 2023

2023

[16] [16]

Animate anyone: Consistent and controllable image- to-video synthesis for character animation,

Li Hu, “Animate anyone: Consistent and controllable image- to-video synthesis for character animation,” inCVPR, 2024, pp. 8153–8163

2024

[17] [17]

Mimicmotion: High-quality human motion video generation with confidence- aware pose guidance,

Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, and Fangyuan Zou, “Mimicmotion: High-quality human motion video generation with confidence- aware pose guidance,”arXiv preprint arXiv:2406.19680, 2024

arXiv 2024

[18] [18]

Champ: Con- trollable and consistent human image animation with 3d para- metric guidance,

Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu, “Champ: Con- trollable and consistent human image animation with 3d para- metric guidance,”arXiv preprint arXiv:2403.14781, 2024

arXiv 2024

[19] [19]

Smpl: A skinned multi- person linear model,

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black, “Smpl: A skinned multi- person linear model,” inSeminal Graphics Papers: Pushing the Boundaries, V olume 2, pp. 851–866. 2023

2023

[20] [20]

Stableanimator: High-quality identity-preserving human image animation,

Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu, “Stableanimator: High-quality identity-preserving human image animation,”arXiv preprint arXiv:2411.17697, 2024

arXiv 2024

[21] [21]

Zero-shot high-fidelity and pose-controllable character animation,

Bingwen Zhu, Fanyi Wang, Tianyi Lu, Peng Liu, Jingwen Su, Jinxiu Liu, Yanhao Zhang, Zuxuan Wu, Guo-Jun Qi, and Yu- Gang Jiang, “Zero-shot high-fidelity and pose-controllable character animation,”arXiv preprint arXiv:2404.13680, 2024

arXiv 2024

[22] [22]

Animatediff: Animate your personalized text-to- image diffusion models without specific tuning,

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yao- hui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai, “Animatediff: Animate your personalized text-to- image diffusion models without specific tuning,”ICLR, 2024

2024

[23] [23]

Grounded sam: Assembling open-world models for diverse visual tasks,

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al., “Grounded sam: Assembling open-world models for diverse visual tasks,”arXiv preprint arXiv:2401.14159, 2024

Pith/arXiv arXiv 2024

[24] [24]

Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing,

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng, “Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing,” inICCV, 2023, pp. 22560–22570

2023

[25] [25]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan, “T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” inAAAI, 2024, vol. 38, pp. 4296–4304

2024

[26] [26]

Mat: Mask-aware transformer for large hole image inpaint- ing,

Wenbo Li, Zhe Lin, Kun Zhou, Lu Qi, Yi Wang, and Jiaya Jia, “Mat: Mask-aware transformer for large hole image inpaint- ing,” inCVPR, 2022, pp. 10758–10768

2022

[27] [27]

High-resolution image synthesis with latent diffusion models,

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer, “High-resolution image synthesis with latent diffusion models,” inCVPR, 2022, pp. 10684– 10695

2022

[28] [28]

Follow-your-pose v2: Multiple-condition guided char- acter image animation for stable pose control,

Jingyun Xue, Hongfa Wang, Qi Tian, Yue Ma, Andong Wang, Zhiyuan Zhao, Shaobo Min, Wenzhe Zhao, Kaihao Zhang, Heung-Yeung Shum, Wei Liu, Mengyang Liu, and Wenhan Luo, “Follow-your-pose v2: Multiple-condition guided char- acter image animation for stable pose control,” 2024

2024

[29] [29]

Ablation study: Why controlnets use deep encoder? what if it was lighter? or even an mlp?,

Lyumin Zhang, “Ablation study: Why controlnets use deep encoder? what if it was lighter? or even an mlp?,”https://github.com/lllyasviel/ ControlNet/discussions/188, Accessed: Feb. 28, 2023

2023

[30] [30]

Fatezero: Fus- ing attentions for zero-shot text-based video editing,

Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen, “Fatezero: Fus- ing attentions for zero-shot text-based video editing,” inICCV, 2023, pp. 15932–15942

2023

[31] [31]

Prompt-to-prompt im- age editing with cross attention control,

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or, “Prompt-to-prompt im- age editing with cross attention control,”arXiv preprint arXiv:2208.01626, 2022

Pith/arXiv arXiv 2022

[32] [32]

Pix2video: Video editing using image diffusion,

Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra, “Pix2video: Video editing using image diffusion,” inICCV, 2023, pp. 23206–23217

2023

[33] [33]

Learning high fidelity depths of dressed humans by watching social media dance videos,

Yasamin Jafarian and Hyun Soo Park, “Learning high fidelity depths of dressed humans by watching social media dance videos,” inCVPR, 2021, pp. 12753–12762

2021

[34] [34]

Everybody dance now,

Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros, “Everybody dance now,” inICCV, 2019, pp. 5933– 5942

2019

[35] [35]

Effec- tive whole-body pose estimation with two-stages distillation,

Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li, “Effec- tive whole-body pose estimation with two-stages distillation,” inICCV, 2023, pp. 4210–4220

2023