pith. machine review for the scientific record. sign in

hub

Towards Accurate Generative Models of Video: A New Metric & Challenges

56 Pith papers cite this work. Polarity classification is still indexing.

56 Pith papers citing it
abstract

Recent advances in deep generative models have lead to remarkable progress in synthesizing high quality images. Following their successful application in image processing and representation learning, an important next step is to consider videos. Learning generative models of video is a much harder task, requiring a model to capture the temporal dynamics of a scene, in addition to the visual presentation of objects. While recent attempts at formulating generative models of video have had some success, current progress is hampered by (1) the lack of qualitative metrics that consider visual quality, temporal coherence, and diversity of samples, and (2) the wide gap between purely synthetic video data sets and challenging real-world data sets in terms of complexity. To this extent we propose Fr\'{e}chet Video Distance (FVD), a new metric for generative models of video, and StarCraft 2 Videos (SCV), a benchmark of game play from custom starcraft 2 scenarios that challenge the current capabilities of generative models of video. We contribute a large-scale human study, which confirms that FVD correlates well with qualitative human judgment of generated videos, and provide initial benchmark results on SCV.

hub tools

citation-role summary

background 1 dataset 1

citation-polarity summary

claims ledger

  • abstract Recent advances in deep generative models have lead to remarkable progress in synthesizing high quality images. Following their successful application in image processing and representation learning, an important next step is to consider videos. Learning generative models of video is a much harder task, requiring a model to capture the temporal dynamics of a scene, in addition to the visual presentation of objects. While recent attempts at formulating generative models of video have had some success, current progress is hampered by (1) the lack of qualitative metrics that consider visual quali

co-cited works

verdicts

UNVERDICTED 56

polarities

background 2

representative citing papers

PhysInOne: Visual Physics Learning and Reasoning in One Suite

cs.CV · 2026-04-10 · unverdicted · novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.

Is Your Driving World Model an All-Around Player?

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

cs.CV · 2026-05-05 · unverdicted · novelty 7.0 · 3 refs

AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.

HumanScore: Benchmarking Human Motions in Generated Videos

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

HumanScore defines six metrics for kinematic plausibility, temporal stability, and biomechanical consistency to benchmark human motions in videos from thirteen state-of-the-art generation models, revealing gaps between visual appeal and physical fidelity.

Efficient Video Diffusion Models: Advancements and Challenges

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

C-MET transfers emotions from speech to facial video by learning cross-modal semantic vectors with pretrained audio and disentangled expression encoders, yielding 14% higher emotion accuracy on MEAD and CREMA-D even for unseen emotions.

MoRight: Motion Control Done Right

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.

Physics-Aware Video Instance Removal Benchmark

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

The PVIR benchmark tests video object removal on physical consistency using 95 annotated videos and shows that existing methods struggle with complex interactions like lingering shadows.

Video Diffusion Models

cs.CV · 2022-04-07 · unverdicted · novelty 7.0

A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.

CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

cs.CV · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.

citing papers explorer

Showing 50 of 56 citing papers.