hub

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao · 2023 · cs.CV · arXiv 2307.04725

48 Pith papers cite this work. Polarity classification is still indexing.

48 Pith papers citing it

open full Pith review browse 48 citing papers arXiv PDF

abstract

With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. However, adding motion dynamics to existing high-quality personalized T2Is and enabling them to generate animations remains an open challenge. In this paper, we present AnimateDiff, a practical framework for animating personalized T2I models without requiring model-specific tuning. At the core of our framework is a plug-and-play motion module that can be trained once and seamlessly integrated into any personalized T2Is originating from the same base T2I. Through our proposed training strategy, the motion module effectively learns transferable motion priors from real-world videos. Once trained, the motion module can be inserted into a personalized T2I model to form a personalized animation generator. We further propose MotionLoRA, a lightweight fine-tuning technique for AnimateDiff that enables a pre-trained motion module to adapt to new motion patterns, such as different shot types, at a low training and data collection cost. We evaluate AnimateDiff and MotionLoRA on several public representative personalized T2I models collected from the community. The results demonstrate that our approaches help these models generate temporally smooth animation clips while preserving the visual quality and motion diversity. Codes and pre-trained weights are available at https://github.com/guoyww/AnimateDiff.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

claims ledger

abstract With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. However, adding motion dynamics to existing high-quality personalized T2Is and enabling them to generate animations remains an open challenge. In this paper, we present AnimateDiff, a practical framework for animating personalized T2I models without requiring model-specific tuning. At the core of our framework is a plug-and-play motion module that can be

co-cited works

representative citing papers

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GRPO training with temporal/spatial IoU rewards.

MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

cs.CV · 2026-04-26 · unverdicted · novelty 7.0 · 2 refs

MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.

UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

cs.CV · 2026-04-19 · unverdicted · novelty 7.0 · 2 refs

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

Flow of Truth introduces a learnable forensic template and template-guided flow module that follows pixel motion to enable temporal tracing in image-to-video generation.

OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.

HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.

AvatarPointillist: AutoRegressive 4D Gaussian Avatarization

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.

OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.

Stream-T1: Test-Time Scaling for Streaming Video Generation

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve temporal consistency and visual quality.

Stylistic Attribute Control in Latent Diffusion Models

cs.CV · 2026-05-04 · unverdicted · novelty 6.0

A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.

AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

cs.CV · 2026-04-28 · unverdicted · novelty 6.0

Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.

Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.

Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.

MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling

cs.CV · 2026-04-19 · unverdicted · novelty 6.0

MetaEarth3D is the first generative foundation model for spatially consistent, unbounded 3D scene generation at planetary scale using optical Earth observation data.

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

cs.CV · 2026-04-17 · unverdicted · novelty 6.0

VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.

Beyond Text Prompts: Precise Concept Erasure through Text-Image Collaboration

cs.CV · 2026-04-17 · unverdicted · novelty 6.0

TICoE achieves more precise and faithful concept erasure in text-to-image models by collaborating text and image data through a convex manifold and hierarchical learning, outperforming prior methods.

Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.

From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.

DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and reference-guided video stylization.

VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

VibeFlow performs versatile video chroma-lux editing in zero-shot fashion by self-supervised disentanglement of structure and color-illumination cues inside pre-trained video models, plus residual velocity fields and consistency regularization.

Generative Refinement Networks for Visual Synthesis

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

citing papers explorer

Showing 48 of 48 citing papers.

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation cs.CV · 2026-05-12 · unverdicted · none · ref 4 · internal anchor
OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating cs.CV · 2026-05-12 · unverdicted · none · ref 13 · internal anchor
CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GRPO training with temporal/spatial IoU rewards.
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation cs.CV · 2026-04-26 · unverdicted · none · ref 10 · 2 links · internal anchor
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models cs.CV · 2026-04-19 · unverdicted · none · ref 29 · 2 links · internal anchor
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation cs.CV · 2026-04-16 · unverdicted · none · ref 7 · internal anchor
Flow of Truth introduces a learnable forensic template and template-guided flow module that follows pixel motion to enable temporal tracing in image-to-video generation.
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control cs.CV · 2026-04-07 · unverdicted · none · ref 15 · internal anchor
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation cs.CV · 2026-04-07 · unverdicted · none · ref 15 · internal anchor
HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.
AvatarPointillist: AutoRegressive 4D Gaussian Avatarization cs.CV · 2026-04-06 · unverdicted · none · ref 20 · internal anchor
AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation cs.CV · 2026-05-12 · unverdicted · none · ref 6 · internal anchor
OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
Stream-T1: Test-Time Scaling for Streaming Video Generation cs.CV · 2026-05-06 · unverdicted · none · ref 8 · internal anchor
Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve temporal consistency and visual quality.
Stylistic Attribute Control in Latent Diffusion Models cs.CV · 2026-05-04 · unverdicted · none · ref 18 · internal anchor
A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation cs.LG · 2026-05-01 · unverdicted · none · ref 8 · internal anchor
AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation cs.CV · 2026-04-28 · unverdicted · none · ref 16 · internal anchor
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation cs.CV · 2026-04-23 · unverdicted · none · ref 9 · internal anchor
Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation cs.CV · 2026-04-21 · unverdicted · none · ref 13 · internal anchor
CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation cs.CV · 2026-04-20 · unverdicted · none · ref 14 · internal anchor
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling cs.CV · 2026-04-19 · unverdicted · none · ref 11 · internal anchor
MetaEarth3D is the first generative foundation model for spatially consistent, unbounded 3D scene generation at planetary scale using optical Earth observation data.
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects cs.CV · 2026-04-17 · unverdicted · none · ref 24 · internal anchor
VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.
Beyond Text Prompts: Precise Concept Erasure through Text-Image Collaboration cs.CV · 2026-04-17 · unverdicted · none · ref 13 · internal anchor
TICoE achieves more precise and faithful concept erasure in text-to-image models by collaborating text and image data through a convex manifold and hierarchical learning, outperforming prior methods.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective cs.CV · 2026-04-15 · unverdicted · none · ref 254 · internal anchor
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation cs.CV · 2026-04-15 · unverdicted · none · ref 9 · internal anchor
Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer cs.CV · 2026-04-15 · unverdicted · none · ref 7 · internal anchor
RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and reference-guided video stylization.
VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning cs.CV · 2026-04-15 · unverdicted · none · ref 12 · internal anchor
VibeFlow performs versatile video chroma-lux editing in zero-shot fashion by self-supervised disentanglement of structure and color-illumination cues inside pre-trained video models, plus residual velocity fields and consistency regularization.
Generative Refinement Networks for Visual Synthesis cs.CV · 2026-04-14 · unverdicted · none · ref 23 · internal anchor
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
ARGen: Affect-Reinforced Generative Augmentation towards Vision-based Dynamic Emotion Perception cs.CV · 2026-04-14 · unverdicted · none · ref 13 · internal anchor
ARGen generates high-fidelity dynamic facial expression videos using affective semantic injection and adaptive reinforcement diffusion to improve emotion recognition models facing data scarcity and long-tail distributions.
Latent-Compressed Variational Autoencoder for Video Diffusion Models cs.CV · 2026-04-12 · unverdicted · none · ref 15 · internal anchor
A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation cs.CV · 2026-04-11 · unverdicted · none · ref 10 · internal anchor
VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.
LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving cs.CV · 2026-04-09 · unverdicted · none · ref 16 · internal anchor
LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks cs.CV · 2026-04-09 · unverdicted · none · ref 8 · internal anchor
ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling cs.CV · 2026-04-08 · unverdicted · none · ref 26 · internal anchor
INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching distillation.
Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision cs.CV · 2026-04-06 · unverdicted · none · ref 10 · internal anchor
Vanast produces coherent garment-transferred human animation videos from a single human image, garment images, and pose guidance video using synthetic triplet supervision and a Dual Module video diffusion transformer architecture.
VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation cs.CV · 2026-04-02 · conditional · none · ref 12 · internal anchor
VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while improving framing and prompt adherence.
Emu3: Next-Token Prediction is All You Need cs.CV · 2024-09-27 · unverdicted · none · ref 28 · internal anchor
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis cs.CV · 2024-09-03 · unverdicted · none · ref 53 · internal anchor
ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer cs.CV · 2024-08-12 · unverdicted · none · ref 77 · internal anchor
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation cs.CV · 2024-04-02 · unverdicted · none · ref 113 · internal anchor
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets cs.CV · 2023-11-25 · conditional · none · ref 32 · internal anchor
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow cs.CV · 2026-05-13 · unverdicted · none · ref 147 · internal anchor
R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.
Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation cs.CV · 2026-04-10 · unverdicted · none · ref 9 · internal anchor
Hitem3D 2.0 combines multi-view image synthesis with native 3D texture projection to improve completeness, cross-view consistency, and geometry alignment over prior methods.
Not all tokens contribute equally to diffusion learning cs.CV · 2026-04-08 · unverdicted · none · ref 5 · internal anchor
DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.
Open-Sora: Democratizing Efficient Video Production for All cs.CV · 2024-12-29 · unverdicted · none · ref 11 · internal anchor
Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tasks with claimed high fidelity.
Movie Gen: A Cast of Media Foundation Models cs.CV · 2024-10-17 · unverdicted · none · ref 21 · internal anchor
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation cs.CV · 2026-04-29 · unverdicted · none · ref 62 · internal anchor
AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantically accurate, temporally coherent animations in seconds.
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation cs.CV · 2026-04-27 · unverdicted · none · ref 25 · internal anchor
World-R1 uses RL with 3D model feedback and a new text dataset to improve geometric consistency in text-to-video generation while keeping the base model unchanged.
Empowering Video Translation using Multimodal Large Language Models cs.CV · 2026-04-13 · unverdicted · none · ref 129 · internal anchor
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation cs.CV · 2026-04-13 · unverdicted · none · ref 53 · internal anchor
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.
Evolution of Video Generative Foundations cs.CV · 2026-04-07 · unverdicted · none · ref 228 · internal anchor
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models cs.CV · 2024-02-27 · unverdicted · none · ref 149 · internal anchor
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer