hub

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, Sylvain Gelly · 2018 · cs.CV · arXiv 1812.01717

56 Pith papers cite this work. Polarity classification is still indexing.

56 Pith papers citing it

open full Pith review browse 56 citing papers arXiv PDF

abstract

Recent advances in deep generative models have lead to remarkable progress in synthesizing high quality images. Following their successful application in image processing and representation learning, an important next step is to consider videos. Learning generative models of video is a much harder task, requiring a model to capture the temporal dynamics of a scene, in addition to the visual presentation of objects. While recent attempts at formulating generative models of video have had some success, current progress is hampered by (1) the lack of qualitative metrics that consider visual quality, temporal coherence, and diversity of samples, and (2) the wide gap between purely synthetic video data sets and challenging real-world data sets in terms of complexity. To this extent we propose Fr\'{e}chet Video Distance (FVD), a new metric for generative models of video, and StarCraft 2 Videos (SCV), a benchmark of game play from custom starcraft 2 scenarios that challenge the current capabilities of generative models of video. We contribute a large-scale human study, which confirms that FVD correlates well with qualitative human judgment of generated videos, and provide initial benchmark results on SCV.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 dataset 1

citation-polarity summary

background 2

claims ledger

abstract Recent advances in deep generative models have lead to remarkable progress in synthesizing high quality images. Following their successful application in image processing and representation learning, an important next step is to consider videos. Learning generative models of video is a much harder task, requiring a model to capture the temporal dynamics of a scene, in addition to the visual presentation of objects. While recent attempts at formulating generative models of video have had some success, current progress is hampered by (1) the lack of qualitative metrics that consider visual quali

co-cited works

representative citing papers

PhysInOne: Visual Physics Learning and Reasoning in One Suite

cs.CV · 2026-04-10 · unverdicted · novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.

GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

GaitProtector optimizes diffusion model latents to impersonate target identities in gait sequences, dropping Rank-1 identification accuracy from 89.6% to 15.0% on CASIA-B while keeping scoliosis diagnostic accuracy at 74.2%.

$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

h-control introduces block-conditional pseudo-Gibbs refinement for training-free camera control in flow-matching video generators, achieving superior FVD scores on RealEstate10K and DAVIS benchmarks.

Is Your Driving World Model an All-Around Player?

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.

ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

ConFixGS repairs feedforward 3D Gaussian Splatting with confidence-aware diffusion priors, delivering up to 3.68 dB PSNR gains and halved FID scores on Waymo, nuScenes, and KITTI novel view synthesis tasks.

One World, Dual Timeline: Decoupled Spatio-Temporal Gaussian Scene Graph for 4D Cooperative Driving Reconstruction

cs.CV · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

DUST decouples pose trajectories per camera source while sharing canonical Gaussians per agent to remove cross-source gradient conflicts and ghosting caused by temporal asynchrony in 4D cooperative driving scenes.

Do Joint Audio-Video Generation Models Understand Physics?

cs.SD · 2026-05-08 · unverdicted · novelty 7.0

Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

cs.CV · 2026-05-05 · unverdicted · novelty 7.0 · 3 refs

AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.

Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

ActDiff-VC achieves up to 64.6% bitrate reduction at matched NIQE and improves perceptual metrics like KID and FID by using content-adaptive keyframe selection and budget-aware sparse trajectory selection to condition a diffusion decoder for ultra-low-bitrate video reconstruction.

ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space

cs.LG · 2026-04-30 · unverdicted · novelty 7.0

ABC enables any-subset autoregressive generation of continuous stochastic processes via non-Markovian diffusion bridges that track physical time and allow path-dependent conditioning.

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.

OccDirector: Language-Guided Behavior and Interaction Generation in 4D Occupancy Space

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

OccDirector uses a VLM-guided Spatio-Temporal MMDiT model with history anchoring to generate physically plausible 4D occupancy from language scripts, supported by the new OccInteract-85k dataset.

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.

HumanScore: Benchmarking Human Motions in Generated Videos

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

HumanScore defines six metrics for kinematic plausibility, temporal stability, and biomechanical consistency to benchmark human motions in videos from thirteen state-of-the-art generation models, revealing gaps between visual appeal and physical fidelity.

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

Efficient Video Diffusion Models: Advancements and Challenges

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

C-MET transfers emotions from speech to facial video by learning cross-modal semantic vectors with pretrained audio and disentangled expression encoders, yielding 14% higher emotion accuracy on MEAD and CREMA-D even for unseen emotions.

MoRight: Motion Control Done Right

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.

OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.

HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.

Physics-Aware Video Instance Removal Benchmark

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

The PVIR benchmark tests video object removal on physical consistency using 95 annotated videos and shows that existing methods struggle with complex interactions like lingering shadows.

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

cs.CV · 2023-10-09 · unverdicted · novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

Video Diffusion Models

cs.CV · 2022-04-07 · unverdicted · novelty 7.0

A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.

CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

cs.CV · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.

citing papers explorer

Showing 50 of 56 citing papers.

PhysInOne: Visual Physics Learning and Reasoning in One Suite cs.CV · 2026-04-10 · unverdicted · none · ref 82 · internal anchor
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.
GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization cs.CV · 2026-05-12 · unverdicted · none · ref 47 · internal anchor
GaitProtector optimizes diffusion model latents to impersonate target identities in gait sequences, dropping Rank-1 identification accuracy from 89.6% to 15.0% on CASIA-B while keeping scoliosis diagnostic accuracy at 74.2%.
$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement cs.CV · 2026-05-12 · unverdicted · none · ref 36 · internal anchor
h-control introduces block-conditional pseudo-Gibbs refinement for training-free camera control in flow-matching video generators, achieving superior FVD scores on RealEstate10K and DAVIS benchmarks.
Is Your Driving World Model an All-Around Player? cs.CV · 2026-05-11 · unverdicted · none · ref 31 · internal anchor
WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.
ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes cs.CV · 2026-05-10 · unverdicted · none · ref 95 · internal anchor
ConFixGS repairs feedforward 3D Gaussian Splatting with confidence-aware diffusion priors, delivering up to 3.68 dB PSNR gains and halved FID scores on Waymo, nuScenes, and KITTI novel view synthesis tasks.
One World, Dual Timeline: Decoupled Spatio-Temporal Gaussian Scene Graph for 4D Cooperative Driving Reconstruction cs.CV · 2026-05-08 · unverdicted · none · ref 18 · 2 links · internal anchor
DUST decouples pose trajectories per camera source while sharing canonical Gaussians per agent to remove cross-source gradient conflicts and ghosting caused by temporal asynchrony in 4D cooperative driving scenes.
Do Joint Audio-Video Generation Models Understand Physics? cs.SD · 2026-05-08 · unverdicted · none · ref 38 · internal anchor
Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics cs.CV · 2026-05-05 · unverdicted · none · ref 33 · 3 links · internal anchor
AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.
Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion cs.CV · 2026-05-04 · unverdicted · none · ref 80 · internal anchor
ActDiff-VC achieves up to 64.6% bitrate reduction at matched NIQE and improves perceptual metrics like KID and FID by using content-adaptive keyframe selection and budget-aware sparse trajectory selection to condition a diffusion decoder for ultra-low-bitrate video reconstruction.
ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space cs.LG · 2026-04-30 · unverdicted · none · ref 61 · internal anchor
ABC enables any-subset autoregressive generation of continuous stochastic processes via non-Markovian diffusion bridges that track physical time and allow path-dependent conditioning.
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling cs.CV · 2026-04-26 · unverdicted · none · ref 21 · internal anchor
Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.
OccDirector: Language-Guided Behavior and Interaction Generation in 4D Occupancy Space cs.CV · 2026-04-24 · unverdicted · none · ref 9 · internal anchor
OccDirector uses a VLM-guided Spatio-Temporal MMDiT model with history anchoring to generate physically plausible 4D occupancy from language scripts, supported by the new OccInteract-85k dataset.
WorldMark: A Unified Benchmark Suite for Interactive Video World Models cs.CV · 2026-04-23 · unverdicted · none · ref 34 · internal anchor
WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.
HumanScore: Benchmarking Human Motions in Generated Videos cs.CV · 2026-04-22 · unverdicted · none · ref 66 · internal anchor
HumanScore defines six metrics for kinematic plausibility, temporal stability, and biomechanical consistency to benchmark human motions in videos from thirteen state-of-the-art generation models, revealing gaps between visual appeal and physical fidelity.
MultiWorld: Scalable Multi-Agent Multi-View Video World Models cs.CV · 2026-04-20 · unverdicted · none · ref 43 · internal anchor
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
Efficient Video Diffusion Models: Advancements and Challenges cs.CV · 2026-04-17 · unverdicted · none · ref 134 · internal anchor
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video cs.CV · 2026-04-09 · unverdicted · none · ref 56 · internal anchor
C-MET transfers emotions from speech to facial video by learning cross-modal semantic vectors with pretrained audio and disentangled expression encoders, yielding 14% higher emotion accuracy on MEAD and CREMA-D even for unseen emotions.
MoRight: Motion Control Done Right cs.CV · 2026-04-08 · unverdicted · none · ref 66 · internal anchor
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control cs.CV · 2026-04-07 · unverdicted · none · ref 32 · internal anchor
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation cs.CV · 2026-04-07 · unverdicted · none · ref 61 · internal anchor
HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.
Physics-Aware Video Instance Removal Benchmark cs.CV · 2026-04-07 · unverdicted · none · ref 23 · internal anchor
The PVIR benchmark tests video object removal on physical consistency using 95 annotated videos and shows that existing methods struggle with complex interactions like lingering shadows.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation cs.CV · 2023-10-09 · unverdicted · none · ref 30 · internal anchor
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Video Diffusion Models cs.CV · 2022-04-07 · unverdicted · none · ref 54 · internal anchor
A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving cs.CV · 2026-05-11 · unverdicted · none · ref 76 · 2 links · internal anchor
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.
SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation cs.CV · 2026-05-11 · unverdicted · none · ref 59 · internal anchor
SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.
DiffATS: Diffusion in Aligned Tensor Space cs.LG · 2026-05-10 · unverdicted · none · ref 59 · internal anchor
DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with high compression.
Implicit Preference Alignment for Human Image Animation cs.CV · 2026-05-08 · unverdicted · none · ref 9 · internal anchor
IPA aligns animation models for superior hand quality via implicit reward maximization on self-generated samples plus hand-focused local optimization, avoiding expensive paired data.
Velox: Learning Representations of 4D Geometry and Appearance cs.CV · 2026-05-06 · unverdicted · none · ref 88 · internal anchor
Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth simulation.
Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing cs.RO · 2026-05-05 · unverdicted · none · ref 20 · internal anchor
A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from single human demonstrations without paired data.
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models cs.CV · 2026-05-03 · unverdicted · none · ref 41 · internal anchor
M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and long-term consistency in multi-modal video generation.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation cs.LG · 2026-05-01 · unverdicted · none · ref 16 · 2 links · internal anchor
AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation cs.CV · 2026-04-28 · unverdicted · none · ref 3 · internal anchor
HuM-Eval evaluates human motion videos with a coarse-to-fine approach using VLM global checks plus 2D pose and 3D motion analysis, reaching 58.2% average correlation with human judgments and introducing a 1000-prompt benchmark.
EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence cs.CV · 2026-04-25 · unverdicted · none · ref 35 · internal anchor
EAD-Net uses a diffusion model with new spatio-temporal attention, graph-based temporal reasoning, and LLM-derived semantic descriptions to generate emotionally expressive talking head videos with improved lip-sync and coherence over prior methods.
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation cs.CV · 2026-04-23 · unverdicted · none · ref 39 · internal anchor
Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting cs.CV · 2026-04-16 · unverdicted · none · ref 18 · internal anchor
Seen-to-Scene unifies propagation-based and generation-based approaches for video outpainting via fine-tuned flow completion and reference-guided latent propagation to deliver superior temporal coherence and efficiency.
FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers cs.CV · 2026-04-14 · unverdicted · none · ref 35 · internal anchor
FreqFormer applies heterogeneous attention (dense global on low frequencies, block-sparse on mid, local on high) plus adaptive spectral routing to reduce attention cost in long-sequence video diffusion transformers.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction cs.CV · 2026-04-13 · unverdicted · none · ref 69 · internal anchor
Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories cs.CV · 2026-04-10 · unverdicted · none · ref 15 · internal anchor
A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.
ELT: Elastic Looped Transformers for Visual Generation cs.CV · 2026-04-10 · unverdicted · none · ref 72 · internal anchor
Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
Lighting-grounded Video Generation with Renderer-based Agent Reasoning cs.CV · 2026-04-09 · unverdicted · none · ref 38 · internal anchor
LiVER conditions video diffusion models on renderer-derived 3D control signals for disentangled, editable control over object layout, lighting, and camera trajectory.
UNICA: A Unified Neural Framework for Controllable 3D Avatars cs.CV · 2026-04-03 · unverdicted · none · ref 63 · internal anchor
UNICA unifies motion planning, rigging, physical simulation, and rendering into a single skeleton-free neural framework that produces next-frame 3D avatar geometry from action inputs and renders it with Gaussian splatting.
CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal cs.CV · 2026-03-23 · unverdicted · none · ref 9 · internal anchor
CLEAR achieves end-to-end mask-free video subtitle removal via dual-encoder self-supervised orthogonality and LoRA-based generation feedback, delivering +6.77 dB PSNR gains and strong zero-shot multilingual performance.
Unified Video Action Model cs.RO · 2025-02-28 · unverdicted · none · ref 43 · internal anchor
UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without performance loss versus task-specific methods.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations cs.CV · 2024-12-19 · unverdicted · none · ref 124 · internal anchor
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation cs.CV · 2024-04-02 · unverdicted · none · ref 151 · internal anchor
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
Latte: Latent Diffusion Transformer for Video Generation cs.CV · 2024-01-05 · unverdicted · none · ref 14 · internal anchor
Latte achieves state-of-the-art video generation on FaceForensics, SkyTimelapse, UCF101, and Taichi-HD by using a latent diffusion transformer with four efficient spatial-temporal decomposition variants and best-practice training choices.
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models cs.CV · 2026-05-07 · unverdicted · none · ref 63 · internal anchor
Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
Syn4D: A Multiview Synthetic 4D Dataset cs.CV · 2026-05-06 · unverdicted · none · ref 104 · internal anchor
Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
Video Generation with Predictive Latents cs.CV · 2026-05-04 · unverdicted · none · ref 47 · internal anchor
PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.
DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation cs.CV · 2026-04-29 · unverdicted · none · ref 29 · internal anchor
DepthPilot generates physically consistent and clinically interpretable colonoscopy videos by injecting depth priors into diffusion models through parameter-efficient fine-tuning and replacing linear denoising weights with adaptive splines.

Towards Accurate Generative Models of Video: A New Metric & Challenges

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer