CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.
hub Canonical reference
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Canonical reference. 75% of citing Pith papers cite this work as background.
abstract
With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. However, adding motion dynamics to existing high-quality personalized T2Is and enabling them to generate animations remains an open challenge. In this paper, we present AnimateDiff, a practical framework for animating personalized T2I models without requiring model-specific tuning. At the core of our framework is a plug-and-play motion module that can be trained once and seamlessly integrated into any personalized T2Is originating from the same base T2I. Through our proposed training strategy, the motion module effectively learns transferable motion priors from real-world videos. Once trained, the motion module can be inserted into a personalized T2I model to form a personalized animation generator. We further propose MotionLoRA, a lightweight fine-tuning technique for AnimateDiff that enables a pre-trained motion module to adapt to new motion patterns, such as different shot types, at a low training and data collection cost. We evaluate AnimateDiff and MotionLoRA on several public representative personalized T2I models collected from the community. The results demonstrate that our approaches help these models generate temporally smooth animation clips while preserving the visual quality and motion diversity. Codes and pre-trained weights are available at https://github.com/guoyww/AnimateDiff.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. However, adding motion dynamics to existing high-quality personalized T2Is and enabling them to generate animations remains an open challenge. In this paper, we present AnimateDiff, a practical framework for animating personalized T2I models without requiring model-specific tuning. At the core of our framework is a plug-and-play motion module that can be
co-cited works
representative citing papers
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
Soap2Soap uses a multi-agent system with dual-bridge consistency via JSON screenplays and visual anchors plus batch keyframe generation to achieve better long-term consistency in cinematic video remaking than commercial APIs.
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4D, iPhone, and DL3DV datasets.
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.
R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
Flow of Truth is the first proactive temporal forensics framework for image-to-video generation that uses a learnable forensic template following pixel motion and a template-guided flow module to decouple motion from content.
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.
AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
MultiAnimate adds Identifier Assigner and Identifier Adapter modules to diffusion video models so they can handle multiple characters without identity mix-ups, generalizing from two-character training data to more characters.
CoMoVi co-generates 3D human motions and 2D videos synchronously in a single diffusion denoising loop using 3D-to-2D projection and dual-branch diffusion with 3D-2D cross attentions.
AVI-Edit enables precise audio-synchronized instance-level video editing via a granularity-aware mask refiner, a self-feedback audio agent, and a new large-scale annotated dataset.
VABench is a new multi-dimensional benchmark for evaluating synchronous audio-video generation across text-to-AV, image-to-AV, and stereo tasks.
ASTRA is a plug-and-play training-free method for precise multi-subject video editing that uses prompt-guided multimodal alignment and prior-based mask retargeting to avoid attention dilution and boundary issues.
Durian introduces a dual-reference diffusion model trained via self-reconstruction on video frames to enable cross-identity attribute transfer in portrait animations, supporting multi-attribute composition and interpolation.
Doloris introduces dual conditional diffusion implicit bridges plus a sparsity masking strategy to model unpaired single-cell perturbation responses and reports state-of-the-art results on public datasets.
GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D inpainting plus optimization, and then feeding them to pre-trained video diffusion.
DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.
citing papers explorer
-
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
-
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
-
HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation
HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.
-
CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos
CoMoVi co-generates 3D human motions and 2D videos synchronously in a single diffusion denoising loop using 3D-to-2D projection and dual-branch diffusion with 3D-2D cross attentions.
-
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction
CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
-
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
-
Stream-T1: Test-Time Scaling for Streaming Video Generation
Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve temporal consistency and visual quality.
-
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
-
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
-
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
-
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.
-
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
-
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
-
Latent-Compressed Variational Autoencoder for Video Diffusion Models
A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
-
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks
ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
-
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching distillation.
-
VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation
VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while improving framing and prompt adherence.
-
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
-
Open-Sora: Democratizing Efficient Video Production for All
Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tasks with claimed high fidelity.
-
AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation
AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantically accurate, temporally coherent animations in seconds.
-
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
- CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
- World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
- Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey