hub

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko · 2022 · cs.CV · arXiv 2210.02303

48 Pith papers cite this work. Polarity classification is still indexing.

48 Pith papers citing it

open full Pith review browse 48 citing papers arXiv PDF

abstract

We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. We describe how we scale up the system as a high definition text-to-video model including design decisions such as the choice of fully-convolutional temporal and spatial super-resolution models at certain resolutions, and the choice of the v-parameterization of diffusion models. In addition, we confirm and transfer findings from previous work on diffusion-based image generation to the video generation setting. Finally, we apply progressive distillation to our video models with classifier-free guidance for fast, high quality sampling. We find Imagen Video not only capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge, including the ability to generate diverse videos and text animations in various artistic styles and with 3D object understanding. See https://imagen.research.google/video/ for samples.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

claims ledger

abstract We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. We describe how we scale up the system as a high definition text-to-video model including design decisions such as the choice of fully-convolutional temporal and spatial super-resolution models at certain resolutions, and the choice of the v-parameterization of diffusion models. In addition, we confi

co-cited works

representative citing papers

Consistency Models

cs.LG · 2023-03-02 · conditional · novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

A two-stage framework augments HOI data with dynamic priors and blends pre-trained dynamic motion and static interaction agents via a composer network to enable long-term dynamic human-object interactions with higher success rates and reduced training time.

DCR: Counterfactual Attractor Guidance for Rare Compositional Generation

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.

Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

ActDiff-VC achieves up to 64.6% bitrate reduction at matched NIQE and improves perceptual metrics like KID and FID by using content-adaptive keyframe selection and budget-aware sparse trajectory selection to condition a diffusion decoder for ultra-low-bitrate video reconstruction.

Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation

cs.GR · 2026-04-28 · unverdicted · novelty 7.0

Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.

Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.

Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation

cs.CV · 2026-04-12 · unverdicted · novelty 7.0

Immune2V immunizes images against dual-stream I2V generation by enforcing temporally balanced latent divergence and aligning generative features to a precomputed collapse trajectory, yielding stronger persistent degradation than image-level baselines.

Training-Free Refinement of Flow Matching with Divergence-based Sampling

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

Flow Divergence Sampler refines flow matching by computing velocity field divergence to correct ambiguous intermediate states during inference, improving fidelity in text-to-image and inverse problem tasks.

OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

cs.SD · 2026-04-06 · unverdicted · novelty 7.0

OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.

Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

cs.CV · 2026-04-03 · conditional · novelty 7.0

SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

cs.CV · 2023-10-09 · unverdicted · novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

cs.CV · 2023-07-10 · unverdicted · novelty 7.0

A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.

OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation

cs.CV · 2026-05-13 · conditional · novelty 6.0

A unified text-conditioned diffusion model generates high-fidelity LiDAR scans across eight domains spanning weather, sensor, and platform shifts using cross-domain training and feature modeling.

FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.

Stage-adaptive audio diffusion modeling

cs.SD · 2026-05-06 · unverdicted · novelty 6.0

A semantic progress signal from SSL discrepancy slope enables three stage-aware mechanisms that improve training efficiency and performance in audio diffusion models over static baselines.

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

cs.CV · 2026-04-28 · unverdicted · novelty 6.0

Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.

Quotient-Space Diffusion Models

cs.LG · 2026-04-23 · unverdicted · novelty 6.0

Quotient-space diffusion models handle symmetries by diffusing on the space of equivalent configurations under group actions like SE(3), reducing learning complexity and guaranteeing correct sampling for molecular generation.

DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with a semantic motion router.

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.

From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.

FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

FreqFormer applies heterogeneous attention (dense global on low frequencies, block-sparse on mid, local on high) plus adaptive spectral routing to reduce attention cost in long-sequence video diffusion transformers.

ARGen: Affect-Reinforced Generative Augmentation towards Vision-based Dynamic Emotion Perception

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

ARGen generates high-fidelity dynamic facial expression videos using affective semantic injection and adaptive reinforcement diffusion to improve emotion recognition models facing data scarcity and long-tail distributions.

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.

VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation

cs.CV · 2026-04-11 · unverdicted · novelty 6.0

VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.

citing papers explorer

Showing 48 of 48 citing papers.

Consistency Models cs.LG · 2023-03-02 · conditional · none · ref 20 · internal anchor
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers cs.CV · 2026-05-12 · unverdicted · none · ref 16 · internal anchor
A two-stage framework augments HOI data with dynamic priors and blends pre-trained dynamic motion and static interaction agents via a composer network to enable long-term dynamic human-object interactions with higher success rates and reduced training time.
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation cs.CV · 2026-05-07 · unverdicted · none · ref 16 · internal anchor
DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion cs.CV · 2026-05-04 · unverdicted · none · ref 63 · internal anchor
ActDiff-VC achieves up to 64.6% bitrate reduction at matched NIQE and improves perceptual metrics like KID and FID by using content-adaptive keyframe selection and budget-aware sparse trajectory selection to condition a diffusion decoder for ultra-low-bitrate video reconstruction.
Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation cs.GR · 2026-04-28 · unverdicted · none · ref 4 · internal anchor
Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation cs.CV · 2026-04-23 · unverdicted · none · ref 8 · internal anchor
Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation cs.CV · 2026-04-12 · unverdicted · none · ref 3 · internal anchor
Immune2V immunizes images against dual-stream I2V generation by enforcing temporally balanced latent divergence and aligning generative features to a precomputed collapse trajectory, yielding stronger persistent degradation than image-level baselines.
Training-Free Refinement of Flow Matching with Divergence-based Sampling cs.CV · 2026-04-06 · unverdicted · none · ref 11 · internal anchor
Flow Divergence Sampler refines flow matching by computing velocity field divergence to correct ambiguous intermediate states during inference, improving fidelity in text-to-image and inverse problem tasks.
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text cs.SD · 2026-04-06 · unverdicted · none · ref 11 · internal anchor
OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation cs.CV · 2026-04-03 · conditional · none · ref 18 · internal anchor
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation cs.CV · 2023-10-09 · unverdicted · none · ref 289 · internal anchor
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning cs.CV · 2023-07-10 · unverdicted · none · ref 8 · internal anchor
A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation cs.CV · 2026-05-13 · conditional · none · ref 65 · internal anchor
A unified text-conditioned diffusion model generates high-fidelity LiDAR scans across eight domains spanning weather, sensor, and platform shifts using cross-domain training and feature modeling.
FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity cs.CV · 2026-05-12 · unverdicted · none · ref 13 · internal anchor
FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.
Stage-adaptive audio diffusion modeling cs.SD · 2026-05-06 · unverdicted · none · ref 4 · internal anchor
A semantic progress signal from SSL discrepancy slope enables three stage-aware mechanisms that improve training efficiency and performance in audio diffusion models over static baselines.
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation cs.CV · 2026-04-28 · unverdicted · none · ref 20 · internal anchor
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
Quotient-Space Diffusion Models cs.LG · 2026-04-23 · unverdicted · none · ref 73 · internal anchor
Quotient-space diffusion models handle symmetries by diffusing on the space of equivalent configurations under group actions like SE(3), reducing learning complexity and guaranteeing correct sampling for molecular generation.
DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion cs.CV · 2026-04-22 · unverdicted · none · ref 7 · internal anchor
DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with a semantic motion router.
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation cs.CV · 2026-04-21 · unverdicted · none · ref 25 · internal anchor
CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation cs.CV · 2026-04-15 · unverdicted · none · ref 13 · internal anchor
Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers cs.CV · 2026-04-14 · unverdicted · none · ref 17 · internal anchor
FreqFormer applies heterogeneous attention (dense global on low frequencies, block-sparse on mid, local on high) plus adaptive spectral routing to reduce attention cost in long-sequence video diffusion transformers.
ARGen: Affect-Reinforced Generative Augmentation towards Vision-based Dynamic Emotion Perception cs.CV · 2026-04-14 · unverdicted · none · ref 14 · internal anchor
ARGen generates high-fidelity dynamic facial expression videos using affective semantic injection and adaptive reinforcement diffusion to improve emotion recognition models facing data scarcity and long-tail distributions.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction cs.CV · 2026-04-13 · unverdicted · none · ref 28 · internal anchor
Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation cs.CV · 2026-04-11 · unverdicted · none · ref 14 · internal anchor
VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models cs.CV · 2026-04-09 · unverdicted · none · ref 22 · internal anchor
NUMINA improves counting accuracy in text-to-video diffusion models by up to 7.4% via a training-free identify-then-guide framework on the new CountBench dataset.
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling cs.CV · 2026-04-08 · unverdicted · none · ref 33 · internal anchor
INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching distillation.
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation? cs.RO · 2026-04-06 · unverdicted · none · ref 21 · internal anchor
Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
Lifting Unlabeled Internet-level Data for 3D Scene Understanding cs.CV · 2026-04-02 · unverdicted · none · ref 44 · internal anchor
Unlabeled web videos processed by designed data engines generate effective training data that yields strong zero-shot and finetuned performance on 3D detection, segmentation, VQA, and navigation.
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion cs.CV · 2025-06-09 · unverdicted · none · ref 25 · internal anchor
Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion for efficiency.
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness cs.CV · 2025-03-27 · accept · none · ref 44 · internal anchor
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.
Unified Video Action Model cs.RO · 2025-02-28 · unverdicted · none · ref 23 · internal anchor
UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without performance loss versus task-specific methods.
LTX-Video: Realtime Video Latent Diffusion cs.CV · 2024-12-30 · conditional · none · ref 27 · internal anchor
LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer cs.CV · 2024-08-12 · unverdicted · none · ref 79 · internal anchor
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation cs.CV · 2024-04-02 · unverdicted · none · ref 118 · internal anchor
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets cs.CV · 2023-11-25 · conditional · none · ref 41 · internal anchor
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation cs.CV · 2023-10-30 · unverdicted · none · ref 27 · internal anchor
Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis cs.CV · 2023-07-04 · conditional · none · ref 15 · internal anchor
SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-the-art generators.
Training Diffusion Models with Reinforcement Learning cs.LG · 2023-05-22 · unverdicted · none · ref 25 · internal anchor
DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation cs.CV · 2026-05-12 · unverdicted · none · ref 11 · internal anchor
RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrative dynamism in sequential image generation.
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers cs.CV · 2026-05-08 · unverdicted · none · ref 11 · internal anchor
Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
The Amazing Stability of Flow Matching cs.CV · 2026-04-17 · unverdicted · none · ref 14 · internal anchor
Flow matching generative models preserve sample quality, diversity, and latent representations despite pruning 50% of the CelebA-HQ dataset or altering architecture and training configurations.
Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation cs.CV · 2026-04-16 · unverdicted · none · ref 15 · internal anchor
Prompt-driven image-to-video generation produces deictic gestures that match real data visually, add useful variety, and improve downstream recognition models when mixed with human recordings.
DiffMagicFace: Identity Consistent Facial Editing of Real Videos cs.CV · 2026-04-15 · unverdicted · none · ref 18 · internal anchor
DiffMagicFace uses concurrent fine-tuned text and image diffusion models plus a rendered multi-view dataset to achieve identity-consistent text-conditioned editing of real facial videos.
Movie Gen: A Cast of Media Foundation Models cs.CV · 2024-10-17 · unverdicted · none · ref 24 · internal anchor
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models cs.CV · 2026-04-06 · unverdicted · none · ref 44 · internal anchor
OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
ModelScope Text-to-Video Technical Report cs.CV · 2023-08-12 · unverdicted · none · ref 16 · internal anchor
ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models cs.CV · 2024-02-27 · unverdicted · none · ref 29 · internal anchor
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Imagen Video: High Definition Video Generation with Diffusion Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer