{"work":{"id":"1f9d1d3b-a6d6-45a9-9f13-51393c03be8a","openalex_id":null,"doi":null,"arxiv_id":"2307.04725","raw_key":null,"title":"AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning","authors":null,"authors_text":"Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao","year":2023,"venue":"cs.CV","abstract":"With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. However, adding motion dynamics to existing high-quality personalized T2Is and enabling them to generate animations remains an open challenge. In this paper, we present AnimateDiff, a practical framework for animating personalized T2I models without requiring model-specific tuning. At the core of our framework is a plug-and-play motion module that can be trained once and seamlessly integrated into any personalized T2Is originating from the same base T2I. Through our proposed training strategy, the motion module effectively learns transferable motion priors from real-world videos. Once trained, the motion module can be inserted into a personalized T2I model to form a personalized animation generator. We further propose MotionLoRA, a lightweight fine-tuning technique for AnimateDiff that enables a pre-trained motion module to adapt to new motion patterns, such as different shot types, at a low training and data collection cost. We evaluate AnimateDiff and MotionLoRA on several public representative personalized T2I models collected from the community. The results demonstrate that our approaches help these models generate temporally smooth animation clips while preserving the visual quality and motion diversity. Codes and pre-trained weights are available at https://github.com/guoyww/AnimateDiff.","external_url":"https://arxiv.org/abs/2307.04725","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-15T01:53:28.922294+00:00","pith_arxiv_id":"2307.04725","created_at":"2026-05-09T06:15:38.154562+00:00","updated_at":"2026-05-15T01:53:28.922294+00:00","title_quality_ok":true,"display_title":"AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning","render_title":"AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning"},"hub":{"state":{"work_id":"1f9d1d3b-a6d6-45a9-9f13-51393c03be8a","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":50,"external_cited_by_count":null,"distinct_field_count":2,"first_pith_cited_at":"2023-10-30T13:12:40+00:00","last_pith_cited_at":"2026-05-14T08:39:42+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-15T02:26:20.446258+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":3},{"context_role":"baseline","n":1},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":3},{"context_polarity":"baseline","n":1},{"context_polarity":"use_method","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T14:51:36.442323+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets","work_id":"4f68eada-27e3-437a-a2fe-6e4ca524d0d3","shared_citers":28},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":28},{"title":"HunyuanVideo: A Systematic Framework For Large Video Generative Models","work_id":"881efa7e-7e73-4c66-9cc3-2803e551061c","shared_citers":22},{"title":"CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer","work_id":"f38fc088-12aa-4bf4-9ecd-08d3e797ccb7","shared_citers":19},{"title":"SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis","work_id":"8034c587-fba6-4941-87ba-c98f2ac962cb","shared_citers":14},{"title":"Make-A-Video: Text-to-Video Generation without Text-Video Data","work_id":"52a801fc-a707-45a1-a8cd-0d6702f124ab","shared_citers":13},{"title":"Imagen Video: High Definition Video Generation with Diffusion Models","work_id":"bb20d241-dc6f-4b0a-b071-fd43a2cbd57f","shared_citers":10},{"title":"Auto-Encoding Variational Bayes","work_id":"97d95295-30e1-42b4-bbf6-85f0fa4edb44","shared_citers":9},{"title":"CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers","work_id":"2dbd6bcd-fc98-4fbf-b586-f6d94fe1abd2","shared_citers":9},{"title":"ModelScope Text-to-Video Technical Report","work_id":"1b1baf78-58ec-44d0-b700-84dff57b2f1f","shared_citers":9},{"title":"CameraCtrl: Enabling Camera Control for Text-to-Video Generation","work_id":"1c05c278-c023-4ef0-a359-25a41f1065eb","shared_citers":8},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":8},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":8},{"title":"Magicvideo: Efficient video generation with latent diffusion models","work_id":"aad71b40-2721-438d-8e8c-97f84063ed39","shared_citers":8},{"title":"Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion","work_id":"53e58ef9-7932-4b83-b757-34ac14db3e0f","shared_citers":8},{"title":"Classifier-Free Diffusion Guidance","work_id":"acf2c588-c088-4a6c-938e-150ad7c666d7","shared_citers":7},{"title":"Denoising Diffusion Implicit Models","work_id":"8fa2128b-d18c-405c-ac92-0e669cf89ac0","shared_citers":7},{"title":"Latte: Latent Diffusion Transformer for Video Generation","work_id":"5328e907-7278-4781-a2bb-c5ef40dc87fb","shared_citers":7},{"title":"Score-Based Generative Modeling through Stochastic Differential Equations","work_id":"d9110e53-a5d4-4794-a4c5-a575e91c31ad","shared_citers":7},{"title":"VideoGPT: Video Generation using VQ-VAE and Transformers","work_id":"703c74c3-fa5e-455c-8c00-697c83511fcf","shared_citers":7},{"title":"Videopoet: A large language model for zero-shot video generation","work_id":"5cc3572d-7e2f-4431-ae42-d9282a42a800","shared_citers":7},{"title":"Latent video diffusion models for high-fidelity video generation with arbitrary lengths","work_id":"23338b3d-620a-4954-904f-bab6a577b8a5","shared_citers":6},{"title":"LTX-Video: Realtime Video Latent Diffusion","work_id":"cee5c521-3ce9-466e-a035-1e42f89254f4","shared_citers":6},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":6}],"time_series":[{"n":1,"year":2023},{"n":7,"year":2024},{"n":39,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T14:51:47.517510+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T14:51:38.955824+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning","claims":[{"claim_text":"With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. However, adding motion dynamics to existing high-quality personalized T2Is and enabling them to generate animations remains an open challenge. In this paper, we present AnimateDiff, a practical framework for animating personalized T2I models without requiring model-specific tuning. At the core of our framework is a plug-and-play motion module that can be","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T14:51:47.520100+00:00"}},"summary":{"title":"AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning","claims":[{"claim_text":"With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. However, adding motion dynamics to existing high-quality personalized T2Is and enabling them to generate animations remains an open challenge. In this paper, we present AnimateDiff, a practical framework for animating personalized T2I models without requiring model-specific tuning. At the core of our framework is a plug-and-play motion module that can be","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets","work_id":"4f68eada-27e3-437a-a2fe-6e4ca524d0d3","shared_citers":28},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":28},{"title":"HunyuanVideo: A Systematic Framework For Large Video Generative Models","work_id":"881efa7e-7e73-4c66-9cc3-2803e551061c","shared_citers":22},{"title":"CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer","work_id":"f38fc088-12aa-4bf4-9ecd-08d3e797ccb7","shared_citers":19},{"title":"SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis","work_id":"8034c587-fba6-4941-87ba-c98f2ac962cb","shared_citers":14},{"title":"Make-A-Video: Text-to-Video Generation without Text-Video Data","work_id":"52a801fc-a707-45a1-a8cd-0d6702f124ab","shared_citers":13},{"title":"Imagen Video: High Definition Video Generation with Diffusion Models","work_id":"bb20d241-dc6f-4b0a-b071-fd43a2cbd57f","shared_citers":10},{"title":"Auto-Encoding Variational Bayes","work_id":"97d95295-30e1-42b4-bbf6-85f0fa4edb44","shared_citers":9},{"title":"CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers","work_id":"2dbd6bcd-fc98-4fbf-b586-f6d94fe1abd2","shared_citers":9},{"title":"ModelScope Text-to-Video Technical Report","work_id":"1b1baf78-58ec-44d0-b700-84dff57b2f1f","shared_citers":9},{"title":"CameraCtrl: Enabling Camera Control for Text-to-Video Generation","work_id":"1c05c278-c023-4ef0-a359-25a41f1065eb","shared_citers":8},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":8},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":8},{"title":"Magicvideo: Efficient video generation with latent diffusion models","work_id":"aad71b40-2721-438d-8e8c-97f84063ed39","shared_citers":8},{"title":"Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion","work_id":"53e58ef9-7932-4b83-b757-34ac14db3e0f","shared_citers":8},{"title":"Classifier-Free Diffusion Guidance","work_id":"acf2c588-c088-4a6c-938e-150ad7c666d7","shared_citers":7},{"title":"Denoising Diffusion Implicit Models","work_id":"8fa2128b-d18c-405c-ac92-0e669cf89ac0","shared_citers":7},{"title":"Latte: Latent Diffusion Transformer for Video Generation","work_id":"5328e907-7278-4781-a2bb-c5ef40dc87fb","shared_citers":7},{"title":"Score-Based Generative Modeling through Stochastic Differential Equations","work_id":"d9110e53-a5d4-4794-a4c5-a575e91c31ad","shared_citers":7},{"title":"VideoGPT: Video Generation using VQ-VAE and Transformers","work_id":"703c74c3-fa5e-455c-8c00-697c83511fcf","shared_citers":7},{"title":"Videopoet: A large language model for zero-shot video generation","work_id":"5cc3572d-7e2f-4431-ae42-d9282a42a800","shared_citers":7},{"title":"Latent video diffusion models for high-fidelity video generation with arbitrary lengths","work_id":"23338b3d-620a-4954-904f-bab6a577b8a5","shared_citers":6},{"title":"LTX-Video: Realtime Video Latent Diffusion","work_id":"cee5c521-3ce9-466e-a035-1e42f89254f4","shared_citers":6},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":6}],"time_series":[{"n":1,"year":2023},{"n":7,"year":2024},{"n":39,"year":2026}],"dependency_candidates":[]},"authors":[]}}