super hub Canonical reference

Imagen Video: High Definition Video Generation with Diffusion Models

Alexey Gritsenko, Chitwan Saharia, Jay Whang, Jonathan Ho, Ruiqi Gao, William Chan · 2022 · cs.CV · arXiv 2210.02303

Canonical reference. 96% of citing Pith papers cite this work as background.

120 Pith papers citing it

Background 96% of classified citations

open full Pith review browse 120 citing papers more from Alexey Gritsenko arXiv PDF

abstract

We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. We describe how we scale up the system as a high definition text-to-video model including design decisions such as the choice of fully-convolutional temporal and spatial super-resolution models at certain resolutions, and the choice of the v-parameterization of diffusion models. In addition, we confirm and transfer findings from previous work on diffusion-based image generation to the video generation setting. Finally, we apply progressive distillation to our video models with classifier-free guidance for fast, high quality sampling. We find Imagen Video not only capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge, including the ability to generate diverse videos and text animations in various artistic styles and with 3D object understanding. See https://imagen.research.google/video/ for samples.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 25 baseline 1

citation-polarity summary

background 25 baseline 1

claims ledger

abstract We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. We describe how we scale up the system as a high definition text-to-video model including design decisions such as the choice of fully-convolutional temporal and spatial super-resolution models at certain resolutions, and the choice of the v-parameterization of diffusion models. In addition, we confi

authors

Alexey Gritsenko Chitwan Saharia Jay Whang Jonathan Ho Ruiqi Gao William Chan

co-cited works

representative citing papers

Quotient-Space Diffusion Models

cs.LG · 2026-04-23 · unverdicted · novelty 8.0

Quotient-space diffusion models generate correct symmetric distributions by removing redundancy on the quotient space, simplifying learning and improving results on small molecules and proteins under SE(3) symmetry.

Consistency Models

cs.LG · 2023-03-02 · conditional · novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.

Where to Refine, When to Stop: Rethinking Redundancy via Latent Discrepancy for Efficient Visual Autoregressive Generation

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

LD-Pruning applies latent discrepancy to prune tokens and adaptively skip unconditional branches in VAR models for up to 2.35x faster inference with preserved quality.

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.

CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.

Functionalization via Structure Completion and Motion Rectification

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.

StreamingEffect: Real-Time Human-Centric Video Effect Generation

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.

Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

cs.CV · 2026-05-15 · unverdicted · novelty 7.0

Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.

WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

cs.RO · 2026-05-15 · unverdicted · novelty 7.0

WorldVLN proposes the first autoregressive world action model for aerial vision-language navigation that predicts short-horizon latent world states, decodes them to waypoints in closed loop, and uses two-stage training with Action-aware GRPO to achieve over 12% success-rate gains on benchmarks plus零

Training-Free Generative Sampling via Moment-Matched Score Smoothing

stat.ML · 2026-05-14 · unverdicted · novelty 7.0

MM-SOLD is a training-free particle sampler whose large-particle limit converges to a moment-matched Gibbs distribution obtained by exponentially tilting a score-smoothed target.

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.

Covariance-aware sampling for Diffusion Models

stat.ML · 2026-05-13 · conditional · novelty 7.0

A covariance-aware extension of DDIM sampling for pixel-space diffusion models that uses Tweedie's formula and Fourier decomposition to model reverse-process covariance and improves sample quality at low NFE.

Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

A two-stage framework augments HOI data with dynamic priors and blends pre-trained dynamic motion and static interaction agents via a composer network to enable long-term dynamic human-object interactions with higher success rates and reduced training time.

DCR: Counterfactual Attractor Guidance for Rare Compositional Generation

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.

Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation

cs.GR · 2026-04-28 · unverdicted · novelty 7.0

Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.

Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.

Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation

cs.CV · 2026-04-12 · unverdicted · novelty 7.0

Immune2V immunizes images against dual-stream I2V generation by enforcing temporally balanced latent divergence and aligning generative features to a precomputed collapse trajectory, yielding stronger persistent degradation than image-level baselines.

Training-Free Refinement of Flow Matching with Divergence-based Sampling

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

Flow Divergence Sampler refines flow matching by computing velocity field divergence to correct ambiguous intermediate states during inference, improving fidelity in text-to-image and inverse problem tasks.

OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

cs.SD · 2026-04-06 · unverdicted · novelty 7.0

OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.

Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

cs.CV · 2026-04-03 · conditional · novelty 7.0

SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.

ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

cs.CV · 2026-03-18 · unverdicted · novelty 7.0

ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

cs.CV · 2026-01-07 · unverdicted · novelty 7.0 · 2 refs

LocalDPO aligns text-to-video diffusion models with human preferences at the spatio-temporal region level by automatically generating localized preference pairs from corrupted real videos and applying a region-aware DPO loss.

citing papers explorer

Showing 50 of 120 citing papers.

Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation? cs.RO · 2026-04-06 · unverdicted · none · ref 21 · internal anchor
Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
Lifting Unlabeled Internet-level Data for 3D Scene Understanding cs.CV · 2026-04-02 · unverdicted · none · ref 44 · internal anchor
Unlabeled web videos processed by designed data engines generate effective training data that yields strong zero-shot and finetuned performance on 3D detection, segmentation, VQA, and navigation.
Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving cs.RO · 2026-02-26 · unverdicted · none · ref 18 · internal anchor
The paper introduces Hyper Diffusion Planner (HDP), a diffusion-based E2E AD framework that identifies insights on loss space, trajectory representation and data scaling, adds RL post-training, and reports 10x performance gains over 200 km of real-world testing across 6 scenarios.
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion cs.CV · 2026-02-08 · unverdicted · none · ref 34 · internal anchor
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World cs.CV · 2025-12-29 · unverdicted · none · ref 28 · internal anchor
DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.
LangPrecip: Language-Aware Multimodal Precipitation Nowcasting cs.LG · 2025-12-26 · unverdicted · none · ref 6 · internal anchor
LangPrecip treats weather text as semantic motion constraints in a rectified-flow trajectory generator to improve multimodal precipitation nowcasting, yielding over 60% and 19% gains in heavy-rain CSI at 80-minute lead times on Swedish and MRMS data.
Splatent: Splatting Diffusion Latents for Novel View Synthesis cs.CV · 2025-12-10 · conditional · none · ref 17 · internal anchor
Splatent recovers fine details for latent-space 3D Gaussian Splatting by applying multi-view attention in 2D rather than reconstructing in 3D space.
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation cs.CV · 2025-12-04 · conditional · none · ref 26 · internal anchor
Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.
SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control cs.RO · 2025-11-11 · unverdicted · none · ref 19 · internal anchor
Scaling motion tracking models along size, data volume, and compute produces a foundation model for natural, robust humanoid whole-body control with downstream uses in kinematic planning and vision-language-action models.
BadGraph: A Backdoor Attack Against Latent Diffusion Model for Text-Guided Graph Generation cs.LG · 2025-10-23 · conditional · none · ref 11 · internal anchor
BadGraph poisons training data with textual triggers to implant backdoors in latent diffusion models for text-guided graph generation, achieving 50% attack success rate at under 10% poisoning and over 80% at 24% poisoning with negligible clean performance loss.
Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging? cs.CV · 2025-10-11 · unverdicted · none · ref 11 · internal anchor
A video-trained large vision model achieves competitive zero-shot performance on organ segmentation, denoising, super-resolution, and 4D CT motion prediction in medical imaging, outperforming some specialized baselines on patient data from 122 cases.
Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency cs.CV · 2025-10-09 · conditional · none · ref 9 · internal anchor
The work introduces rCM, a score-regularized continuous-time consistency model that matches DMD2 quality on large models up to 14B parameters while improving diversity and enabling 1-4 step sampling.
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation cs.CV · 2025-10-02 · conditional · none · ref 17 · internal anchor
Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time cs.CV · 2025-09-29 · unverdicted · none · ref 67 · internal anchor
Rolling Forcing generates multi-minute videos in real time by jointly denoising frames at increasing noise levels, anchoring attention to early frames, and using windowed distillation to limit error accumulation.
Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility cs.CV · 2025-09-29 · unverdicted · none · ref 12 · internal anchor
A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.
DreamAudio: Customized Text-to-Audio Generation with Diffusion Models cs.SD · 2025-09-07 · unverdicted · none · ref 39 · internal anchor
DreamAudio generates audio clips that incorporate user-specified personalized audio events from reference samples while remaining aligned with text prompts.
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion cs.CV · 2025-06-09 · unverdicted · none · ref 25 · internal anchor
Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion for efficiency.
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness cs.CV · 2025-03-27 · accept · none · ref 44 · internal anchor
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.
Unified Video Action Model cs.RO · 2025-02-28 · unverdicted · none · ref 23 · internal anchor
UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without performance loss versus task-specific methods.
LTX-Video: Realtime Video Latent Diffusion cs.CV · 2024-12-30 · conditional · none · ref 27 · internal anchor
LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation cs.CV · 2024-12-30 · unverdicted · none · ref 8 · internal anchor
VisionReward learns multi-dimensional human preferences for image and video generation via hierarchical assessment and linear weighting, outperforming VideoScore by 17.2% in prediction accuracy and yielding 31.6% higher win rates in text-to-video models.
Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation cs.CV · 2024-11-24 · unverdicted · none · ref 30 · internal anchor
LetsTalk combines a multimodal diffusion transformer, noise-regularized memory bank, deep compression autoencoder, and symbiotic/direct fusion schemes to achieve state-of-the-art quality and efficiency in long-duration talking video generation.
How Far is Video Generation from World Model: A Physical Law Perspective cs.CV · 2024-11-04 · conditional · none · ref 6 · internal anchor
Video generation models generalize perfectly inside the training distribution but fail out-of-distribution and rely on case-based mimicking of nearest training examples instead of abstracting physical laws.
Diffusion Policy Policy Optimization cs.RO · 2024-09-01 · unverdicted · none · ref 39 · internal anchor
DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer cs.CV · 2024-08-12 · unverdicted · none · ref 79 · internal anchor
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
CAT3D: Create Anything in 3D with Multi-View Diffusion Models cs.CV · 2024-05-16 · conditional · none · ref 49 · internal anchor
A multi-view diffusion model generates consistent novel views from sparse images to enable fast 3D scene reconstruction.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation cs.CV · 2024-04-02 · unverdicted · none · ref 118 · internal anchor
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
VideoPoet: A Large Language Model for Zero-Shot Video Generation cs.CV · 2023-12-21 · unverdicted · none · ref 19 · internal anchor
VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets cs.CV · 2023-11-25 · conditional · none · ref 41 · internal anchor
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation cs.CV · 2023-10-30 · unverdicted · none · ref 27 · internal anchor
Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
Directly Fine-Tuning Diffusion Models on Differentiable Rewards cs.CV · 2023-09-29 · conditional · none · ref 10 · internal anchor
DRaFT fine-tunes diffusion models by differentiating through sampling to maximize rewards, outperforming RL baselines and improving aesthetics on Stable Diffusion 1.4.
MVDream: Multi-view Diffusion for 3D Generation cs.CV · 2023-08-31 · conditional · none · ref 170 · internal anchor
MVDream is a multi-view diffusion model that functions as a generalizable 3D prior, enabling more consistent text-to-3D generation and few-shot 3D concept learning from 2D examples.
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory cs.CV · 2023-08-16 · unverdicted · none · ref 10 · internal anchor
DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
TokenFlow: Consistent Diffusion Features for Consistent Video Editing cs.CV · 2023-07-19 · conditional · none · ref 8 · internal anchor
TokenFlow produces consistent text-driven video edits by propagating diffusion features according to inter-frame correspondences extracted from the source video.
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis cs.CV · 2023-07-04 · conditional · none · ref 15 · internal anchor
SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-the-art generators.
Training Diffusion Models with Reinforcement Learning cs.LG · 2023-05-22 · unverdicted · none · ref 25 · internal anchor
DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
Shap-E: Generating Conditional 3D Implicit Functions cs.CV · 2023-05-03 · accept · none · ref 23 · internal anchor
Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.
Scaling Robot Learning with Semantically Imagined Experience cs.RO · 2023-02-22 · unverdicted · none · ref 74 · internal anchor
Augmenting robot datasets via diffusion-based semantic inpainting enables manipulation policies to solve unseen tasks with new objects and improves robustness to novel distractors.
Latent Video Diffusion Models for High-Fidelity Long Video Generation cs.CV · 2022-11-23 · unverdicted · none · ref 8 · internal anchor
Latent-space hierarchical diffusion models with targeted error-correction techniques generate realistic videos exceeding 1000 frames while using less compute than prior pixel-space approaches.
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers cs.CV · 2022-11-02 · unverdicted · none · ref 25 · internal anchor
An ensemble of stage-specialized text-to-image diffusion models improves prompt alignment over single shared-parameter models while preserving visual quality and inference speed.
BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension cs.CV · 2026-06-07 · unverdicted · none · ref 3 · internal anchor
BioVid is a data-driven autoregressive model using 2D-encode/3D-decode tokenization and causal Transformer with EOS termination that reproduces real action duration distributions (W1 distance 1.24 frames) on NTU RGB+D drinking clips, outperforming fixed-length baselines.
Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies cs.LG · 2026-05-31 · unverdicted · none · ref 11 · internal anchor
LP-DS improves generative policies for imitation and RL by optimizing latent noise perturbations with a constrained Lagrangian objective, showing up to 25% better returns on manipulation and locomotion tasks.
From Prompts to Context: An Ontology-Driven Framework for Human-Generative AI Collaboration cs.HC · 2026-05-28 · unverdicted · none · ref 42 · internal anchor
Presents the CCAI ontology and SPARQL retrieval method to convert ephemeral Human-Generative AI prompt interactions into explicit, machine-readable collaboration traces, illustrated in a competency-profile software case study.
DisagFusion: Asynchronous Pipeline Parallelism and Elastic Scheduling for Disaggregated Diffusion Serving cs.DC · 2026-05-25 · unverdicted · none · ref 7 · internal anchor
DisagFusion achieves 3.4x-20.5x higher throughput and 18.5x lower latency for diffusion serving via asynchronous pipeline parallelism and elastic hybrid scheduling on disaggregated hardware.
One-Forcing: Towards Stable One-Step Autoregressive Video Generation cs.CV · 2026-05-22 · unverdicted · none · ref 22 · internal anchor
One-Forcing augments DMD with a GAN loss to enable stable one-step causal autoregressive video generation, reporting a VBench score of 83.76 as SOTA among one-step methods.
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset cs.CV · 2026-05-20 · unverdicted · none · ref 34 · internal anchor
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation cs.CV · 2026-05-12 · unverdicted · none · ref 11 · internal anchor
RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrative dynamism in sequential image generation.
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers cs.CV · 2026-05-08 · unverdicted · none · ref 11 · internal anchor
Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
The Amazing Stability of Flow Matching cs.CV · 2026-04-17 · unverdicted · none · ref 14 · internal anchor
Flow matching generative models preserve sample quality, diversity, and latent representations despite pruning 50% of the CelebA-HQ dataset or altering architecture and training configurations.

Imagen Video: High Definition Video Generation with Diffusion Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer