Goku provides a 2M-pair dataset for multi-task structural video editing, Goku-Edit model with MLLM and dual-branch design, and Goku-Bench yielding up to 8% gains in instruction following.
super hub Canonical reference
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Canonical reference. 75% of citing Pith papers cite this work as background.
abstract
With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. However, adding motion dynamics to existing high-quality personalized T2Is and enabling them to generate animations remains an open challenge. In this paper, we present AnimateDiff, a practical framework for animating personalized T2I models without requiring model-specific tuning. At the core of our framework is a plug-and-play motion module that can be trained once and seamlessly integrated into any personalized T2Is originating from the same base T2I. Through our proposed training strategy, the motion module effectively learns transferable motion priors from real-world videos. Once trained, the motion module can be inserted into a personalized T2I model to form a personalized animation generator. We further propose MotionLoRA, a lightweight fine-tuning technique for AnimateDiff that enables a pre-trained motion module to adapt to new motion patterns, such as different shot types, at a low training and data collection cost. We evaluate AnimateDiff and MotionLoRA on several public representative personalized T2I models collected from the community. The results demonstrate that our approaches help these models generate temporally smooth animation clips while preserving the visual quality and motion diversity. Codes and pre-trained weights are available at https://github.com/guoyww/AnimateDiff.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. However, adding motion dynamics to existing high-quality personalized T2Is and enabling them to generate animations remains an open challenge. In this paper, we present AnimateDiff, a practical framework for animating personalized T2I models without requiring model-specific tuning. At the core of our framework is a plug-and-play motion module that can be
authors
co-cited works
representative citing papers
Introduces PexelsCustom-1M dataset, CustoMDiT parameter-efficient model, and OpenCustom benchmark for open-domain customized video generation.
OmniTryOn performs multi-object video virtual try-on in one pass using first-frame wearable caching and spatiotemporal RoPE, outperforming single-garment baselines on a new TryAny-Bench dataset.
SPAWN enables training-free insertion of custom visual concepts into autoregressive world models by swapping the pinned context-memory anchor over a short injection window.
SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.
MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.
Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.
CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
Soap2Soap uses a multi-agent system with dual-bridge consistency via JSON screenplays and visual anchors plus batch keyframe generation to achieve better long-term consistency in cinematic video remaking than commercial APIs.
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4D, iPhone, and DL3DV datasets.
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.
MechVerse benchmark shows current video generation models preserve appearance but fail at mechanically admissible motion, with errors rising as coupling complexity increases.
A training-free method reformulates camera control as geometric displacement fields applied via differentiable latent resampling, enabling control and bias probing in video diffusion models.
OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
Flow of Truth is the first proactive temporal forensics framework for image-to-video generation that uses a learnable forensic template following pixel motion and a template-guided flow module to decouple motion from content.
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.
AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
citing papers explorer
-
Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing
Goku provides a 2M-pair dataset for multi-task structural video editing, Goku-Edit model with MLLM and dual-branch design, and Goku-Bench yielding up to 8% gains in instruction following.
-
A Comprehensive Ecosystem for Open-Domain Customized Video Generation
Introduces PexelsCustom-1M dataset, CustoMDiT parameter-efficient model, and OpenCustom benchmark for open-domain customized video generation.
-
OmniTryOn: Video Try-On Anything at Once!
OmniTryOn performs multi-object video virtual try-on in one pass using first-frame wearable caching and spatiotemporal RoPE, outperforming single-garment baselines on a new TryAny-Bench dataset.
-
From Zero to Hero: Training-Free Custom Concept Spawning in World Models
SPAWN enables training-free insertion of custom visual concepts into autoregressive world models by swapping the pinned context-memory anchor over a short injection window.
-
SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation
SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.
-
MBench: A Comprehensive Benchmark on Memory Capability for Video World Models
MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
-
YoCausal: How Far is Video Generation from World Model? A Causality Perspective
YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.
-
World Models as Group Actions
Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.
-
CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration
CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.
-
Functionalization via Structure Completion and Motion Rectification
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
-
Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration
Soap2Soap uses a multi-agent system with dual-bridge consistency via JSON screenplays and visual anchors plus batch keyframe generation to achieve better long-term consistency in cinematic video remaking than commercial APIs.
-
StreamingEffect: Real-Time Human-Centric Video Effect Generation
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
-
DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis
DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4D, iPhone, and DL3DV datasets.
-
Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.
-
MechVerse: Evaluating Physical Motion Consistency in Video Generation Models
MechVerse benchmark shows current video generation models preserve appearance but fail at mechanically admissible motion, with errors rising as coupling complexity increases.
-
Probing into Camera Control of Video Models
A training-free method reformulates camera control as geometric displacement fields applied via differentiable latent resampling, enabling control and bias probing in video diffusion models.
-
OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
-
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.
-
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
-
Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation
Flow of Truth is the first proactive temporal forensics framework for image-to-video generation that uses a learnable forensic template following pixel motion and a template-guided flow module to decouple motion from content.
-
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
-
HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation
HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.
-
AvatarPointillist: AutoRegressive 4D Gaussian Avatarization
AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.
-
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
-
MultiAnimate: Pose-Guided Image Animation Made Extensible
MultiAnimate adds Identifier Assigner and Identifier Adapter modules to diffusion video models so they can handle multiple characters without identity mix-ups, generalizing from two-character training data to more characters.
-
CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos
CoMoVi co-generates 3D human motions and 2D videos synchronously in a single diffusion denoising loop using 3D-to-2D projection and dual-branch diffusion with 3D-2D cross attentions.
-
AVI-Edit: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner
AVI-Edit enables precise audio-synchronized instance-level video editing via a granularity-aware mask refiner, a self-feedback audio agent, and a new large-scale annotated dataset.
-
VABench: A Comprehensive Benchmark for Audio-Video Generation
VABench is a new multi-dimensional benchmark for evaluating synchronous audio-video generation across text-to-AV, image-to-AV, and stereo tasks.
-
ASTRA: Let Arbitrary Subjects Transform in Video Editing
ASTRA is a plug-and-play training-free method for precise multi-subject video editing that uses prompt-guided multimodal alignment and prior-based mask retargeting to avoid attention dilution and boundary issues.
-
Durian: Dual Reference Image-Guided Portrait Animation with Attribute Transfer
Durian introduces a dual-reference diffusion model trained via self-reconstruction on video frames to enable cross-identity attribute transfer in portrait animations, supporting multi-attribute composition and interpolation.
-
Doloris: Dual Conditional Diffusion Implicit Bridges with Sparsity Masking Strategy for Unpaired Single-Cell Perturbation Estimation
Doloris introduces dual conditional diffusion implicit bridges plus a sparsity masking strategy to model unpaired single-cell perturbation responses and reports state-of-the-art results on public datasets.
-
GenHSI: Controllable Generation of Human-Scene Interaction Videos
GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D inpainting plus optimization, and then feeding them to pre-trained video diffusion.
-
History-Guided Video Diffusion
DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.
-
HandsOnWorld: Unconstrained Egocentric Video Generation with Camera-Disentangled Hand Control
HandsOnWorld creates a hand-controlled egocentric video generator from unconstrained monocular video via a new EgoVid-Pro dataset from monocular reconstruction and a Plücker Hand Map that disentangles camera and hand motion.
-
NeoMap: Training-free Novel-View Synthesis from Single Images and Videos
NeoMap introduces a training-free framework using convergent manifold alternating projection iterations to extract high-fidelity novel views from pre-trained video models, outperforming prior methods on standard benchmarks.
-
ReAge3D: Re-Aging 3D Faces with View Consistency
ReAge3D trains a diffusion re-aging model on synthetic pairs then uses masked propagation from a frontal pivot view to produce consistent multi-view images that supervise 3D face optimization.
-
PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory
PermaVid disentangles spatial context into semantic appearance and geometric structure via multi-modal memory banks and edit-aware updates to maintain long-term consistency in video generation after edits.
-
OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data
OmniDirector introduces a grid-based camera representation and hierarchical prompt agent for multi-shot camera cloning in video diffusion models trained on million-scale unpaired data.
-
Latent Spatial Memory for Video World Models
Mirage stores and queries 3D scene information in diffusion latent space via depth-guided lifting and warping, yielding 10.57× faster generation and 55× smaller memory than explicit RGB point-cloud baselines while reaching SOTA on WorldScore.
-
LiteVSR: Lightweight Adaptation of Frozen Diffusion Transformers for Video Super-Resolution
LiteVSR performs video super-resolution on a completely frozen Diffusion Transformer via a lightweight State-Aware Adapter that uses dual-stream extraction and time-dependent cross-attention, reaching competitive quality with 11.25% trainable parameters after 12 GPU-hours.
-
OmniGen-AR: AutoRegressive Any-to-Image Generation
OmniGen-AR is a unified autoregressive framework for any-to-image generation that tokenizes text and visual conditions together and uses disentangled causal attention to support tasks like text-to-image, depth-to-image, image editing, and text-to-video while reporting 0.63 on GenEval and 80.02 on VB
-
C3VD-DEFCOL: A Deformable Colonoscopy Dataset with Time-Resolved 3D Ground Truth and Realistic Appearance
C3VD-DEFCOL supplies 110 videos from 11 colon meshes with paired realistic RGB appearance and dense time-resolved 3D ground truth under three levels of parameterized peristaltic deformation for benchmarking deformable reconstruction.
-
Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy
DirectAnimator bypasses pose extraction using a Driving Cue Triplet and Same2X training strategy to achieve state-of-the-art human animation quality and robustness from raw videos.
-
Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting
A parameter-free approach drops redundant video tokens via temporal L1 differences in frozen latent space and reconstructs them with LIT, yielding 31x speedup over ElasticTok-CV on TokenBench and DAVIS.
-
Pave-GRPO: Beyond Instantaneous Guidance through Principled Average Velocity Decomposition
Pave-GRPO reformulates GRPO via principled average velocity decomposition to enable denser temporal supervision in flow-based generative model alignment without increasing rollout cost.
-
CameraNoise: Enabling Faithful Camera Control in Video Diffusion through Geometry-Flow-Guided Noise Warping
CameraNoise embeds camera motion into the noise space of video diffusion via Geometry-guided Reprojection Flow and noise warping to achieve faithful trajectory control while preserving the diffusion prior.
-
DRFusion: Drift-Resilient Temporally Consistent Infrared-Visible Video Fusion
DRFusion uses Stabilized History Guidance, Soft Temporal Anchoring, and Decoupled Structure-Motion Adaptation to achieve drift-resilient temporal consistency in infrared-visible video fusion.
-
Lance: Unified Multimodal Modeling by Multi-Task Synergy
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.
-
VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers
VISTA introduces a new synthetic triplet dataset and diffusion-transformer framework with style adapter that jointly models style, content, and motion to achieve state-of-the-art video style transfer.
-
ReactiveGWM: Steering NPC in Reactive Game World Models
ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.