arxiv: 2210.02303 · v1 · submitted 2022-10-05 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho , William Chan , Chitwan Saharia , Jay Whang , Ruiqi Gao , Alexey Gritsenko , Diederik P. Kingma , Ben Poole

show 3 more authors

Mohammad Norouzi David J. Fleet Tim Salimans

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:25 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords video generationdiffusion modelstext-to-videosuper-resolutionhigh definitioncascade modelsprogressive distillation

0 comments

The pith

A cascade of diffusion models produces high-definition videos from text with controllability and world knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper is trying to establish that a cascaded series of video diffusion models can produce high-definition videos from text prompts that are both high fidelity and controllable, showing understanding of the world. It does this by using a base model for initial generation followed by spatial and temporal super-resolution stages, along with specific choices like v-parameterization and progressive distillation. If this is true, it would mean that high-quality video synthesis is possible with diffusion approaches, similar to images, enabling more advanced creative and practical uses. Sympathetic readers care because it addresses common limitations in earlier video generation methods such as low resolution and lack of temporal coherence.

Core claim

We present a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, the system generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. Design decisions include fully-convolutional temporal and spatial super-resolution models at certain resolutions and the v-parameterization of diffusion models. Findings from previous work on diffusion-based image generation are confirmed and transferred to the video setting. Progressive distillation is applied to the video models with classifier-free guidance for fast, high quality sampling. The system is capable

What carries the argument

The cascade architecture of video diffusion models consisting of a base generation stage followed by interleaved fully-convolutional spatial and temporal super-resolution stages, using v-parameterization and progressive distillation for sampling.

If this is right

The system generates high-fidelity videos conditioned on text prompts.
Videos exhibit high controllability and diversity including text animations.
The generated content demonstrates world knowledge such as 3D object understanding.
Progressive distillation allows for fast sampling without sacrificing quality.
Image diffusion techniques transfer effectively to the video domain with appropriate adaptations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This cascade method may be adaptable to generate longer videos by extending the temporal super-resolution chain.
The controllability could enable new tools for creators in media production.
Further research might explore combining this with other modalities like audio for synchronized content.
Scaling the models could improve resolution or reduce artifacts in complex scenes.

Load-bearing premise

The chosen cascade of base model and super-resolution stages with v-parameterization and distillation produces temporally coherent high-definition output without major artifacts.

What would settle it

A collection of generated videos that exhibit temporal flickering, motion artifacts, or poor alignment with the text prompt at high resolutions would show the central claim does not hold.

read the original abstract

We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. We describe how we scale up the system as a high definition text-to-video model including design decisions such as the choice of fully-convolutional temporal and spatial super-resolution models at certain resolutions, and the choice of the v-parameterization of diffusion models. In addition, we confirm and transfer findings from previous work on diffusion-based image generation to the video generation setting. Finally, we apply progressive distillation to our video models with classifier-free guidance for fast, high quality sampling. We find Imagen Video not only capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge, including the ability to generate diverse videos and text animations in various artistic styles and with 3D object understanding. See https://imagen.research.google/video/ for samples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Imagen Video shows a working cascaded diffusion pipeline for high-def text-to-video that transfers image techniques with concrete architectural choices, but rests mostly on qualitative samples.

read the letter

The main thing here is a practical system that generates high-definition videos from text by running a base video diffusion model followed by interleaved spatial and temporal super-resolution stages, with v-parameterization and progressive distillation to keep sampling fast. They adapt several image-generation findings to video and report that the stack produces controllable output with some apparent world knowledge, including style variation and 3D object handling. The design decisions around fully-convolutional super-resolution models at specific resolutions are spelled out clearly enough that someone could try to reproduce the cascade. The samples look strong on fidelity and diversity. The soft spots are the evaluation. The paper supplies no quantitative metrics, no ablations on the cascade components, and no direct comparisons to other video diffusion work, so the claims about temporal coherence and controllability rest on visual inspection alone. That is common in this area but still leaves the contribution harder to weigh precisely. The assumption that the chosen architecture avoids major artifacts holds up in the demos shown, yet without failure cases or robustness tests it is not fully stress-tested. This paper is for people building or extending generative video systems who want concrete implementation details on scaling diffusion cascades. A reader focused on practical transfer from images to video will get usable ideas from the architecture section. It deserves peer review because the system is substantial, the results move the practical frontier, and referees can push for the missing metrics and comparisons without needing to reject the core approach.

Referee Report

1 major / 2 minor

Summary. The manuscript presents Imagen Video, a text-conditional high-definition video generation system based on a cascade of diffusion models. It consists of a base video generation model followed by a sequence of interleaved spatial and temporal video super-resolution models. The authors describe scaling decisions including the use of fully-convolutional SR models at selected resolutions, adoption of v-parameterization, transfer of findings from image diffusion models to the video domain, and progressive distillation combined with classifier-free guidance to enable fast sampling. They claim the resulting system generates videos of high fidelity with controllability, world knowledge, stylistic diversity, text animations, and 3D object understanding, supported by qualitative results.

Significance. If the qualitative demonstrations hold under closer scrutiny, this work is significant for showing that cascaded diffusion models can be scaled to produce temporally coherent high-definition text-to-video output. The empirical transfer of image-generation techniques (v-parameterization, progressive distillation) to video and the practical efficiency gains are useful contributions that could guide subsequent generative video systems.

major comments (1)

The central claims of high fidelity, temporal coherence, and 3D object understanding rest on qualitative samples alone. No quantitative metrics (e.g., FVD, CLIP similarity, or user-study scores), ablation studies on the interleaving order of spatial/temporal SR stages, or failure-case analysis are referenced in the abstract or high-level description, which is load-bearing for assessing whether the cascade design actually avoids major artifacts at scale.

minor comments (2)

Abstract: the phrase 'we confirm and transfer findings from previous work' would be clearer if the specific findings (e.g., particular hyper-parameters or architectural motifs) were enumerated.
The provided link to samples is helpful, but the manuscript would benefit from an explicit limitations paragraph discussing known artifacts or prompt regimes where controllability degrades.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the work and the recommendation for minor revision. We address the major comment below.

read point-by-point responses

Referee: The central claims of high fidelity, temporal coherence, and 3D object understanding rest on qualitative samples alone. No quantitative metrics (e.g., FVD, CLIP similarity, or user-study scores), ablation studies on the interleaving order of spatial/temporal SR stages, or failure-case analysis are referenced in the abstract or high-level description, which is load-bearing for assessing whether the cascade design actually avoids major artifacts at scale.

Authors: We appreciate the referee highlighting the evaluation approach. The manuscript's central claims are indeed supported primarily through extensive qualitative results, which we view as the most informative way to demonstrate emergent properties such as controllability, stylistic diversity, and 3D understanding that current automated metrics do not fully capture. The full paper contains detailed qualitative analysis, comparisons to prior work, and a large number of generated examples. To strengthen the high-level presentation, we will revise the abstract to explicitly reference the qualitative evaluation strategy used throughout the manuscript. We will also add a dedicated discussion of limitations and representative failure cases in the revised version. Our choice of interleaving spatial and temporal super-resolution stages was guided by preliminary scaling experiments; we can incorporate a concise rationale for this design choice in the methods section without requiring new large-scale ablations. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical systems description with transferred findings

full rationale

The paper describes an implemented cascade of text-conditional video diffusion models (base + interleaved spatial/temporal SR stages), v-parameterization, and progressive distillation. Design decisions are presented as empirical choices whose success is shown via qualitative samples and controllability demonstrations. No derivation chain, equations, or 'predictions' are claimed that reduce the output to fitted parameters or self-citations by construction. Prior image-generation findings are transferred as independent empirical support rather than used to close a logical loop. The work is self-contained against external benchmarks (generated video quality and controllability).

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work rests on standard diffusion model assumptions and many unfisted hyperparameters typical of large-scale generative training.

pith-pipeline@v0.9.0 · 5517 in / 1069 out tokens · 43666 ms · 2026-05-11T03:25:12.739360+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We describe how we scale up the system as a high definition text-to-video model including design decisions such as the choice of fully-convolutional temporal and spatial super-resolution models at certain resolutions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 57 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Quotient-Space Diffusion Models
cs.LG 2026-04 unverdicted novelty 8.0

Quotient-space diffusion models generate correct symmetric distributions by removing redundancy on the quotient space, simplifying learning and improving results on small molecules and proteins under SE(3) symmetry.
Consistency Models
cs.LG 2023-03 conditional novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
Training-Free Generative Sampling via Moment-Matched Score Smoothing
stat.ML 2026-05 unverdicted novelty 7.0

MM-SOLD is a training-free particle sampler whose large-particle limit converges to a moment-matched Gibbs distribution obtained by exponentially tilting a score-smoothed target.
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
Covariance-aware sampling for Diffusion Models
stat.ML 2026-05 conditional novelty 7.0

A covariance-aware extension of DDIM sampling for pixel-space diffusion models that uses Tweedie's formula and Fourier decomposition to model reverse-process covariance and improves sample quality at low NFE.
Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers
cs.CV 2026-05 unverdicted novelty 7.0

A two-stage framework augments HOI data with dynamic priors and blends pre-trained dynamic motion and static interaction agents via a composer network to enable long-term dynamic human-object interactions with higher ...
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
cs.CV 2026-05 unverdicted novelty 7.0

DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

ActDiff-VC achieves up to 64.6% bitrate reduction at matched NIQE and improves perceptual metrics like KID and FID by using content-adaptive keyframe selection and budget-aware sparse trajectory selection to condition...
Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation
cs.GR 2026-04 unverdicted novelty 7.0

Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Immune2V immunizes images against dual-stream I2V generation by enforcing temporally balanced latent divergence and aligning generative features to a precomputed collapse trajectory, yielding stronger persistent degra...
Training-Free Refinement of Flow Matching with Divergence-based Sampling
cs.CV 2026-04 unverdicted novelty 7.0

Flow Divergence Sampler refines flow matching by computing velocity field divergence to correct ambiguous intermediate states during inference, improving fidelity in text-to-image and inverse problem tasks.
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
cs.SD 2026-04 unverdicted novelty 7.0

OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation
cs.CV 2026-04 conditional novelty 7.0

SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation
cs.CV 2026-03 unverdicted novelty 7.0

ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
cs.CV 2023-07 unverdicted novelty 7.0

A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
cs.CV 2026-05 unverdicted novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation
cs.CV 2026-05 conditional novelty 6.0

A unified text-conditioned diffusion model generates high-fidelity LiDAR scans across eight domains spanning weather, sensor, and platform shifts using cross-domain training and feature modeling.
FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
cs.CV 2026-05 unverdicted novelty 6.0

FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.
Stage-adaptive audio diffusion modeling
cs.SD 2026-05 unverdicted novelty 6.0

A semantic progress signal from SSL discrepancy slope enables three stage-aware mechanisms that improve training efficiency and performance in audio diffusion models over static baselines.
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
cs.CV 2026-04 unverdicted novelty 6.0

Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
Quotient-Space Diffusion Models
cs.LG 2026-04 unverdicted novelty 6.0

Quotient-space diffusion models handle symmetries by diffusing on the space of equivalent configurations under group actions like SE(3), reducing learning complexity and guaranteeing correct sampling for molecular generation.
DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion
cs.CV 2026-04 unverdicted novelty 6.0

DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with...
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
cs.CV 2026-04 unverdicted novelty 6.0

Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers
cs.CV 2026-04 unverdicted novelty 6.0

FreqFormer applies heterogeneous attention (dense global on low frequencies, block-sparse on mid, local on high) plus adaptive spectral routing to reduce attention cost in long-sequence video diffusion transformers.
ARGen: Affect-Reinforced Generative Augmentation towards Vision-based Dynamic Emotion Perception
cs.CV 2026-04 unverdicted novelty 6.0

ARGen generates high-fidelity dynamic facial expression videos using affective semantic injection and adaptive reinforcement diffusion to improve emotion recognition models facing data scarcity and long-tail distributions.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
cs.CV 2026-04 unverdicted novelty 6.0

Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
cs.CV 2026-04 unverdicted novelty 6.0

VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

NUMINA improves counting accuracy in text-to-video diffusion models by up to 7.4% via a training-free identify-then-guide framework on the new CountBench dataset.
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 6.0

INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?
cs.RO 2026-04 unverdicted novelty 6.0

Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
Lifting Unlabeled Internet-level Data for 3D Scene Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Unlabeled web videos processed by designed data engines generate effective training data that yields strong zero-shot and finetuned performance on 3D detection, segmentation, VQA, and navigation.
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
cs.CV 2025-06 unverdicted novelty 6.0

Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
cs.CV 2025-03 accept novelty 6.0

VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
Unified Video Action Model
cs.RO 2025-02 unverdicted novelty 6.0

UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...
LTX-Video: Realtime Video Latent Diffusion
cs.CV 2024-12 conditional novelty 6.0

LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
cs.CV 2024-08 unverdicted novelty 6.0

CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
cs.CV 2024-04 unverdicted novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
cs.CV 2023-11 conditional novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
cs.CV 2023-10 unverdicted novelty 6.0

Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
MVDream: Multi-view Diffusion for 3D Generation
cs.CV 2023-08 conditional novelty 6.0

MVDream is a multi-view diffusion model that functions as a generalizable 3D prior, enabling more consistent text-to-3D generation and few-shot 3D concept learning from 2D examples.
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
cs.CV 2023-07 conditional novelty 6.0

SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
Training Diffusion Models with Reinforcement Learning
cs.LG 2023-05 unverdicted novelty 6.0

DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
Latent Video Diffusion Models for High-Fidelity Long Video Generation
cs.CV 2022-11 unverdicted novelty 6.0

Latent-space hierarchical diffusion models with targeted error-correction techniques generate realistic videos exceeding 1000 frames while using less compute than prior pixel-space approaches.
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
cs.CV 2022-11 unverdicted novelty 6.0

An ensemble of stage-specialized text-to-image diffusion models improves prompt alignment over single shared-parameter models while preserving visual quality and inference speed.
RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation
cs.CV 2026-05 unverdicted novelty 5.0

RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrativ...
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 5.0

Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
cs.CV 2026-05 unverdicted novelty 5.0

ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
The Amazing Stability of Flow Matching
cs.CV 2026-04 unverdicted novelty 5.0

Flow matching generative models preserve sample quality, diversity, and latent representations despite pruning 50% of the CelebA-HQ dataset or altering architecture and training configurations.
Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation
cs.CV 2026-04 unverdicted novelty 5.0

Prompt-driven image-to-video generation produces deictic gestures that match real data visually, add useful variety, and improve downstream recognition models when mixed with human recordings.
DiffMagicFace: Identity Consistent Facial Editing of Real Videos
cs.CV 2026-04 unverdicted novelty 5.0

DiffMagicFace uses concurrent fine-tuned text and image diffusion models plus a rendered multi-view dataset to achieve identity-consistent text-conditioned editing of real facial videos.
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
ModelScope Text-to-Video Technical Report
cs.CV 2023-08 unverdicted novelty 4.0

ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
cs.CV 2024-02 unverdicted novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 56 Pith papers · 7 internal anchors

[1]

Stochastic Variational Video Prediction

LAION-400M. https://laion.ai/blog/laion-400-open-dataset/ . Mohammad Babaeizadeh, Chelsea Finn, D. Erhan, Roy H. Campbell, and Sergey Levine. Stochastic variational video prediction. ArXiv, abs/1710.11252,

work page Pith review arXiv
[2]

arXiv:2106.13195 , year=

URL https://arxiv. org/abs/2106.13195. Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? . In Proceedings of FAccT 2021,

work page arXiv 2021
[3]

Multimodal datasets: misog- yny, pornography, and malignant stereotypes

Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. Multimodal datasets: misogyny, pornography, and malignant stereotypes. In arXiv:2110.01963,

work page arXiv
[4]

Cogview: Mastering text-to- image generation via transformers

URL https://arxiv.org/abs/2105.13290. Chelsea Finn, Ian J. Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. ArXiv, abs/1605.07157,

work page arXiv
[5]

arXiv preprint arXiv:2206.11894 , year=

Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Mart’in-Mart’in, and Li Fei-Fei. Maskvit: Masked visual pre-training for video prediction. ArXiv, abs/2206.11894,

work page arXiv
[6]

Flexible diffusion modeling of long videos

William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. ArXiv, abs/2205.11495,

work page arXiv
[7]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718,

work page internal anchor Pith review arXiv
[8]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.arXiv preprint arXiv:1706.08500,

work page Pith review arXiv
[9]

Classiﬁer-free diffusion guidance

Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications,

work page 2021
[10]

Video Diffusion Models

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Sali- mans. Cascaded diffusion models for high ﬁdelity image generation. JMLR, 2022a. Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video Diffusion Models. In arXiv:2204.03458, 2022b. Nal Kalchbrenner, A ¨aron van den Oor...

work page internal anchor Pith review arXiv
[11]

Elucidating the Design Space of Diffusion-Based Generative Models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion- based generative models. arXiv preprint arXiv:2206.00364,

work page internal anchor Pith review arXiv
[12]

Variational diffusion models

Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. arXiv preprint arXiv:2107.00630,

work page arXiv
[13]

arXiv preprint arXiv:1511.05440 (2015)

Micha¨el Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. CoRR, abs/1511.05440,

work page arXiv
[14]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Bob McGrew Pamela Mishkin, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In arXiv:2112.10741,

work page internal anchor Pith review arXiv
[15]

Szlam, Joan Bruna, Micha¨el Mathieu, Ronan Collobert, and Sumit Chopra

Marc’Aurelio Ranzato, Arthur D. Szlam, Joan Bruna, Micha¨el Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. ArXiv, abs/1412.6604,

work page arXiv
[16]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-A-Video: Text-to-Video Generation without Text-Video Data. InarXiv:2209.14792,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[18]

arXiv:1905.09883 , year=

Belinda Tzen and Maxim Raginsky. Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit. In arXiv:1905.09883,

work page arXiv 1905
[19]

FVD: A new Metric for Video Generation

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Rapha ¨el Marinier, Marcin Michalski, and Sylvain Gelly. FVD: A new Metric for Video Generation. In ICLR 2022 Workshop: Deep Generative Models for Highly Structured Data,

work page 2022
[20]

Generating videos with scene dynamics

Carl V ondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. ArXiv, abs/1609.02612,

work page arXiv
[21]

Dif- fusion probabilistic modeling for video generation

Ruihan Yang, Prakhar Srivastava, and Stephan Mandt. Diffusion Probabilistic Modeling for Video Generation. In arXiv:2203.09481,

work page arXiv
[22]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, and Yonghui Wu Jason Baldridge. Scaling Autoregressive Models for Content- Rich Text-to-Image Generation. In arXiv:2206.10789,

work page internal anchor Pith review arXiv