super hub Canonical reference

Imagen Video: High Definition Video Generation with Diffusion Models

Alexey Gritsenko, Chitwan Saharia, Jay Whang, Jonathan Ho, Ruiqi Gao, William Chan · 2022 · cs.CV · arXiv 2210.02303

Canonical reference. 96% of citing Pith papers cite this work as background.

134 Pith papers citing it

Background 96% of classified citations

open full Pith review browse 134 citing papers more from Alexey Gritsenko arXiv PDF

abstract

We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. We describe how we scale up the system as a high definition text-to-video model including design decisions such as the choice of fully-convolutional temporal and spatial super-resolution models at certain resolutions, and the choice of the v-parameterization of diffusion models. In addition, we confirm and transfer findings from previous work on diffusion-based image generation to the video generation setting. Finally, we apply progressive distillation to our video models with classifier-free guidance for fast, high quality sampling. We find Imagen Video not only capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge, including the ability to generate diverse videos and text animations in various artistic styles and with 3D object understanding. See https://imagen.research.google/video/ for samples.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 25 baseline 1

citation-polarity summary

background 25 baseline 1

claims ledger

abstract We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. We describe how we scale up the system as a high definition text-to-video model including design decisions such as the choice of fully-convolutional temporal and spatial super-resolution models at certain resolutions, and the choice of the v-parameterization of diffusion models. In addition, we confi

authors

Alexey Gritsenko Chitwan Saharia Jay Whang Jonathan Ho Ruiqi Gao William Chan

co-cited works

representative citing papers

Quotient-Space Diffusion Models

cs.LG · 2026-04-23 · unverdicted · novelty 8.0

Quotient-space diffusion models generate correct symmetric distributions by removing redundancy on the quotient space, simplifying learning and improving results on small molecules and proteins under SE(3) symmetry.

Consistency Models

cs.LG · 2023-03-02 · conditional · novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

DVG-WM disentangles dynamics learning and visual synthesis in video world models using flow matching and latent degradation to achieve faster inference up to 3.97 times with improved quality on LIBERO and real-world robotic platforms.

Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

PhaseLock extracts motion priors from 2-step inference and enforces them via Latent Delta Guidance to raise physical consistency scores by 6.2 points on average in image-to-video diffusion models.

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

Dream.exe evaluates 8 video generation models on 101 manipulation tasks by converting generated videos into executable robot trajectories in a simulator, finding measurable success rates that visual metrics do not predict.

Inference-Time Scaling for Joint Audio-Video Generation

cs.MM · 2026-06-02 · unverdicted · novelty 7.0

Presents multi-verifier framework and Adaptive Reward Weighting (ARW) for inference-time scaling in joint audio-video generation, reporting gains in alignment and synchronization on VGGSound and JavisBench-mini.

Why Are DMD Students Lazy? Understanding the Copying Behavior in Few-Step Distillation

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

In high-dimensional DMD distillation, student models spontaneously copy teacher noise-data pairings as an emergent effect of limited geometric freedom rather than adversarial objectives or memorization.

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.

Knowledge-Intensive Video Generation

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

KIVI-Bench and new metrics reveal that seven state-of-the-art text-to-video models lag humans on factual accuracy, procedural clarity, and information presentation.

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.

Where to Refine, When to Stop: Rethinking Redundancy via Latent Discrepancy for Efficient Visual Autoregressive Generation

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

LD-Pruning applies latent discrepancy to prune tokens and adaptively skip unconditional branches in VAR models for up to 2.35x faster inference with preserved quality.

World Models as Group Actions

cs.CV · 2026-05-23 · unverdicted · novelty 7.0

Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.

CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.

Functionalization via Structure Completion and Motion Rectification

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.

StreamingEffect: Real-Time Human-Centric Video Effect Generation

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.

Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

cs.CV · 2026-05-15 · unverdicted · novelty 7.0

Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.

WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

cs.RO · 2026-05-15 · unverdicted · novelty 7.0

WorldVLN proposes the first autoregressive world action model for aerial vision-language navigation that predicts short-horizon latent world states, decodes them to waypoints in closed loop, and uses two-stage training with Action-aware GRPO to achieve over 12% success-rate gains on benchmarks plus零

Training-Free Generative Sampling via Moment-Matched Score Smoothing

stat.ML · 2026-05-14 · unverdicted · novelty 7.0

MM-SOLD is a training-free particle sampler whose large-particle limit converges to a moment-matched Gibbs distribution obtained by exponentially tilting a score-smoothed target.

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.

Covariance-aware sampling for Diffusion Models

stat.ML · 2026-05-13 · conditional · novelty 7.0

A covariance-aware extension of DDIM sampling for pixel-space diffusion models that uses Tweedie's formula and Fourier decomposition to model reverse-process covariance and improves sample quality at low NFE.

Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

A two-stage framework augments HOI data with dynamic priors and blends pre-trained dynamic motion and static interaction agents via a composer network to enable long-term dynamic human-object interactions with higher success rates and reduced training time.

DCR: Counterfactual Attractor Guidance for Rare Compositional Generation

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.

Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation

cs.GR · 2026-04-28 · unverdicted · novelty 7.0

Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.

citing papers explorer

Showing 50 of 134 citing papers.

Quotient-Space Diffusion Models cs.LG · 2026-04-23 · unverdicted · none · ref 73 · internal anchor
Quotient-space diffusion models generate correct symmetric distributions by removing redundancy on the quotient space, simplifying learning and improving results on small molecules and proteins under SE(3) symmetry.
Consistency Models cs.LG · 2023-03-02 · conditional · none · ref 20 · internal anchor
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation cs.RO · 2026-06-30 · unverdicted · none · ref 15 · internal anchor
DVG-WM disentangles dynamics learning and visual synthesis in video world models using flow matching and latent degradation to achieve faster inference up to 3.97 times with improved quality on LIBERO and real-world robotic platforms.
Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them cs.CV · 2026-06-04 · unverdicted · none · ref 48 · internal anchor
PhaseLock extracts motion priors from 2-step inference and enforces them via Latent Delta Guidance to raise physical consistency scores by 6.2 points on average in image-to-video diffusion models.
Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation? cs.CV · 2026-06-03 · unverdicted · none · ref 28 · internal anchor
Dream.exe evaluates 8 video generation models on 101 manipulation tasks by converting generated videos into executable robot trajectories in a simulator, finding measurable success rates that visual metrics do not predict.
Inference-Time Scaling for Joint Audio-Video Generation cs.MM · 2026-06-02 · unverdicted · none · ref 4 · internal anchor
Presents multi-verifier framework and Adaptive Reward Weighting (ARW) for inference-time scaling in joint audio-video generation, reporting gains in alignment and synchronization on VGGSound and JavisBench-mini.
Why Are DMD Students Lazy? Understanding the Copying Behavior in Few-Step Distillation cs.LG · 2026-06-01 · unverdicted · none · ref 25 · internal anchor
In high-dimensional DMD distillation, student models spontaneously copy teacher noise-data pairings as an emergent effect of limited geometric freedom rather than adversarial objectives or memorization.
SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation cs.CV · 2026-05-31 · unverdicted · none · ref 18 · internal anchor
SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.
Knowledge-Intensive Video Generation cs.CV · 2026-05-31 · unverdicted · none · ref 3 · internal anchor
KIVI-Bench and new metrics reveal that seven state-of-the-art text-to-video models lag humans on factual accuracy, procedural clarity, and information presentation.
MBench: A Comprehensive Benchmark on Memory Capability for Video World Models cs.CV · 2026-05-30 · unverdicted · none · ref 29 · internal anchor
MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
Where to Refine, When to Stop: Rethinking Redundancy via Latent Discrepancy for Efficient Visual Autoregressive Generation cs.CV · 2026-05-29 · unverdicted · none · ref 3 · internal anchor
LD-Pruning applies latent discrepancy to prune tokens and adaptively skip unconditional branches in VAR models for up to 2.35x faster inference with preserved quality.
World Models as Group Actions cs.CV · 2026-05-23 · unverdicted · none · ref 50 · internal anchor
Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models cs.CV · 2026-05-22 · unverdicted · none · ref 22 · internal anchor
CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration cs.CV · 2026-05-21 · unverdicted · none · ref 22 · internal anchor
CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.
Functionalization via Structure Completion and Motion Rectification cs.CV · 2026-05-18 · unverdicted · none · ref 113 · internal anchor
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
StreamingEffect: Real-Time Human-Centric Video Effect Generation cs.CV · 2026-05-16 · unverdicted · none · ref 22 · internal anchor
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation cs.CV · 2026-05-15 · unverdicted · none · ref 7 · internal anchor
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.
WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation cs.RO · 2026-05-15 · unverdicted · none · ref 15 · internal anchor
WorldVLN proposes the first autoregressive world action model for aerial vision-language navigation that predicts short-horizon latent world states, decodes them to waypoints in closed loop, and uses two-stage training with Action-aware GRPO to achieve over 12% success-rate gains on benchmarks plus零
Training-Free Generative Sampling via Moment-Matched Score Smoothing stat.ML · 2026-05-14 · unverdicted · none · ref 9 · internal anchor
MM-SOLD is a training-free particle sampler whose large-particle limit converges to a moment-matched Gibbs distribution obtained by exponentially tilting a score-smoothed target.
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion cs.CV · 2026-05-13 · unverdicted · none · ref 20 · internal anchor
TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
Covariance-aware sampling for Diffusion Models stat.ML · 2026-05-13 · conditional · none · ref 8 · internal anchor
A covariance-aware extension of DDIM sampling for pixel-space diffusion models that uses Tweedie's formula and Fourier decomposition to model reverse-process covariance and improves sample quality at low NFE.
Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers cs.CV · 2026-05-12 · unverdicted · none · ref 16 · internal anchor
A two-stage framework augments HOI data with dynamic priors and blends pre-trained dynamic motion and static interaction agents via a composer network to enable long-term dynamic human-object interactions with higher success rates and reduced training time.
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation cs.CV · 2026-05-07 · unverdicted · none · ref 16 · internal anchor
DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation cs.GR · 2026-04-28 · unverdicted · none · ref 4 · internal anchor
Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation cs.CV · 2026-04-23 · unverdicted · none · ref 8 · internal anchor
Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation cs.CV · 2026-04-12 · unverdicted · none · ref 3 · internal anchor
Immune2V immunizes images against dual-stream I2V generation by enforcing temporally balanced latent divergence and aligning generative features to a precomputed collapse trajectory, yielding stronger persistent degradation than image-level baselines.
Training-Free Refinement of Flow Matching with Divergence-based Sampling cs.CV · 2026-04-06 · unverdicted · none · ref 11 · internal anchor
Flow Divergence Sampler refines flow matching by computing velocity field divergence to correct ambiguous intermediate states during inference, improving fidelity in text-to-image and inverse problem tasks.
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text cs.SD · 2026-04-06 · unverdicted · none · ref 11 · internal anchor
OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation cs.CV · 2026-04-03 · conditional · none · ref 18 · internal anchor
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation cs.CV · 2026-03-18 · unverdicted · none · ref 24 · internal anchor
ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models cs.CV · 2026-01-07 · unverdicted · none · ref 18 · 2 links · internal anchor
LocalDPO aligns text-to-video diffusion models with human preferences at the spatio-temporal region level by automatically generating localized preference pairs from corrupted real videos and applying a region-aware DPO loss.
Large Video Planner Enables Generalizable Robot Control cs.RO · 2025-12-17 · conditional · none · ref 38 · internal anchor
A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
ASTRA: Let Arbitrary Subjects Transform in Video Editing cs.CV · 2025-10-01 · unverdicted · none · ref 10 · internal anchor
ASTRA is a plug-and-play training-free method for precise multi-subject video editing that uses prompt-guided multimodal alignment and prior-based mask retargeting to avoid attention dilution and boundary issues.
Beyond the Frame: Generating 360 Panoramic Videos from Perspective Videos cs.CV · 2025-04-10 · unverdicted · none · ref 22 · internal anchor
A generative model produces realistic and coherent 360 panoramic videos from in-the-wild perspective videos via curated online data and geometry-motion aware operations.
History-Guided Video Diffusion cs.LG · 2025-02-10 · unverdicted · none · ref 24 · internal anchor
DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.
Diffusion Models Are Real-Time Game Engines cs.LG · 2024-08-27 · conditional · none · ref 69 · internal anchor
A diffusion model trained on DOOM play sessions generates stable real-time interactive game frames at 20 FPS with quality near lossy JPEG.
RoboDreamer: Learning Compositional World Models for Robot Imagination cs.RO · 2024-04-18 · unverdicted · none · ref 71 · internal anchor
RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.
Learning Interactive Real-World Simulators cs.AI · 2023-10-09 · conditional · none · ref 220 · internal anchor
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation cs.CV · 2023-10-09 · unverdicted · none · ref 289 · internal anchor
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning cs.CV · 2023-07-10 · unverdicted · none · ref 8 · internal anchor
A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
EMOSH: Expressive Motion and Shape Disentanglement for Human Animation cs.CV · 2026-06-26 · unverdicted · none · ref 13 · internal anchor
EMOSH proposes an Expressive Human Model with disentangled parameters, coarse-to-fine motion injection, and spatially-aligned conditioning to generate high-fidelity expressive human videos without driving-subject shape leakage.
Streaming Video Generation with Streaming Force Control cs.CV · 2026-06-05 · unverdicted · none · ref 21 · internal anchor
StreamForce presents a unified causal model for force-controllable streaming video generation using a new force representation and distillation pipeline, claiming SOTA force adherence and 16.6 FPS performance.
Time-Aware Diffusion based on Preference Disentanglement for Generative Recommendation cs.IR · 2026-06-01 · unverdicted · none · ref 9 · internal anchor
TDPM is a diffusion-based generative recommender that disentangles user preferences into period and point components to enable time-aware diffusion on semantic indices, reporting up to 29% gains on HR@20 and NDCG@20 over baselines on three datasets.
PAI-Studio: Cinematic Video Background Replacement with Camera-Aware Motion cs.CV · 2026-05-31 · unverdicted · none · ref 16 · internal anchor
PAI-Studio reformulates cinematic background replacement as in-context conditional generation inside a Diffusion Transformer with bidirectional attention, trained on a new 30K film-sourced dataset, and reports better motion consistency and relighting than prior open-source and commercial systems.
Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing cs.CV · 2026-05-31 · unverdicted · none · ref 17 · internal anchor
Introduces cross-axis feature fusion transformers and a joint-wise Soft-DTW regression auxiliary task for text-based 3D human motion editing, reporting state-of-the-art results on the MotionFix dataset.
T-CLIP: Enabling Thermal Perception for Contrastive Language-Image Pretraining cs.CV · 2026-05-30 · unverdicted · none · ref 46 · internal anchor
T-CLIP introduces a physics-aware thermal captioning dataset (IR-Cap) and a decoupled dual-LoRA adaptation of CLIP that improves cross-modal retrieval on thermal benchmarks by separating scene-level and object-level thermal understanding.
A Systematic Study of Behavioral Cloning for Scientific Data Annotation cs.HC · 2026-05-26 · unverdicted · none · ref 278 · internal anchor
Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.
Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models cs.CV · 2026-05-19 · unverdicted · none · ref 15 · 3 links · internal anchor
Reference-frame dominance in self-attention suppresses motion in image-to-video models; DyMoS rebalances attention from generated frames to the reference during initial denoising steps to improve dynamics while preserving fidelity.
Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models? cs.CV · 2026-05-15 · unverdicted · none · ref 16 · internal anchor
AdaScope adaptively selects optimal RL intervention points during diffusion denoising by monitoring structural and semantic changes, delivering 66% higher performance at 59% lower cost than full-trajectory RL baselines.
See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation cs.AI · 2026-05-15 · unverdicted · none · ref 2 · internal anchor
OmniManim improves render quality in educational animation code generation by using a Vision Agent with coarse-to-fine bounding-box denoising and interpolation-aware optimization on new datasets.

Imagen Video: High Definition Video Generation with Diffusion Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer