arxiv: 1812.01717 · v2 · submitted 2018-12-03 · 💻 cs.CV · cs.AI· cs.LG· cs.NE· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Towards Accurate Generative Models of Video: A New Metric & Challenges

Karol Kurach, Marcin Michalski, Raphael Marinier, Sjoerd van Steenkiste, Sylvain Gelly, Thomas Unterthiner

Pith reviewed 2026-05-11 07:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.NEstat.ML

keywords generative modelsvideo synthesisevaluation metricsFréchet distanceStarCraft 2human evaluationtemporal coherence

0 comments

The pith

Fréchet Video Distance scores generative video models by how closely their outputs match the statistics of real videos in a learned feature space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Fréchet Video Distance to give generative models of video a quantitative score that accounts for visual quality, motion consistency, and sample variety. It pairs this metric with the StarCraft 2 Videos benchmark, a collection of gameplay clips chosen to expose weaknesses in current models that simpler datasets hide. A large human study then checks that the new distance tracks the same qualities people notice when watching the videos. Without such tools, training and comparing video generators remains unreliable even as image generators advance rapidly.

Core claim

We propose Fréchet Video Distance (FVD), a metric that fits multivariate Gaussians to the feature statistics of real and generated video sets and reports the Fréchet distance between those Gaussians. We also release the StarCraft 2 Videos (SCV) benchmark of gameplay sequences drawn from custom scenarios. A large-scale human study confirms that FVD correlates with human ratings of visual quality, temporal coherence, and diversity, while initial experiments on SCV show that existing models fall short on long-range dynamics and object interactions.

What carries the argument

Fréchet Video Distance, which extends the Fréchet Inception Distance to video by comparing Gaussian distributions over spatio-temporal features extracted from real and generated clips.

If this is right

Training objectives or model selection can now use FVD directly instead of relying on pixel-level or frame-wise losses alone.
Models can be compared on SCV to reveal whether they capture long-term scene dynamics that simpler datasets do not test.
Progress on video generation can be measured automatically with a signal that aligns with human perception of coherence and realism.
New architectures can be iterated more quickly by tracking FVD on held-out SCV clips during development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

FVD could serve as a training signal if differentiated through the feature extractor, allowing end-to-end optimization toward lower distances.
The SCV benchmark pattern of using game engines for controlled yet complex scenes might transfer to other domains such as robotics or autonomous driving simulation.
If FVD generalizes across datasets, it could reduce the need for repeated large human studies when evaluating new video models.

Load-bearing premise

The human study participants judge video quality in the same way that would matter for downstream applications.

What would settle it

A video generator that receives high human ratings yet produces a large FVD score, or a generator that scores well on FVD yet looks poor to viewers.

read the original abstract

Recent advances in deep generative models have lead to remarkable progress in synthesizing high quality images. Following their successful application in image processing and representation learning, an important next step is to consider videos. Learning generative models of video is a much harder task, requiring a model to capture the temporal dynamics of a scene, in addition to the visual presentation of objects. While recent attempts at formulating generative models of video have had some success, current progress is hampered by (1) the lack of qualitative metrics that consider visual quality, temporal coherence, and diversity of samples, and (2) the wide gap between purely synthetic video data sets and challenging real-world data sets in terms of complexity. To this extent we propose Fr\'{e}chet Video Distance (FVD), a new metric for generative models of video, and StarCraft 2 Videos (SCV), a benchmark of game play from custom starcraft 2 scenarios that challenge the current capabilities of generative models of video. We contribute a large-scale human study, which confirms that FVD correlates well with qualitative human judgment of generated videos, and provide initial benchmark results on SCV.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FVD gives the video generation field a usable extension of FID plus a tougher benchmark, and the human study makes the claims worth taking seriously.

read the letter

The main thing to know is that this paper supplies Fréchet Video Distance, which applies the same Gaussian feature-distance idea as FID but to video clips via I3D embeddings, and pairs it with the StarCraft 2 Videos benchmark of gameplay sequences that are deliberately more complex than the usual synthetic datasets. They back the metric with a large human study showing decent correlation to subjective quality judgments on coherence and realism. That combination is the actual advance here. Prior video generation papers had to rely on frame-wise image metrics or ad-hoc visual inspection, so a single scalar that tries to fold in temporal structure and diversity is a practical step forward. The benchmark also forces models to handle longer-range dynamics and structured motion that simpler datasets do not stress. The human validation is the part that gives the metric some grounding rather than leaving it as an untested formula. On the softer side, the choice of I3D features is reasonable but inherits whatever biases the action-recognition pre-training carries, and it is not obvious that the same embedding will remain optimal as video models improve. The SCV data is game-derived, so its statistics are cleaner and more repetitive than real-world footage; that makes it a good stress test but also means generalization to natural video still needs separate checking. The paper does not appear to have circularity or hidden fitting in the metric definition itself. This work is aimed at people actively training and comparing generative video models who need a reproducible score they can report alongside samples. Anyone running ablation studies on temporal consistency or diversity will get immediate use from both the metric code and the dataset. It is solid enough on the core claims to deserve a serious referee rather than a desk reject; the empirical validation and the gap it fills are real even if later work refines the feature extractor or broadens the benchmark.

Referee Report

2 major / 2 minor

Summary. The paper proposes Fréchet Video Distance (FVD), an extension of the Fréchet Inception Distance (FID) to the video domain that uses features extracted from a pre-trained I3D model to measure both visual quality and temporal coherence in generated videos. It also introduces the StarCraft 2 Videos (SCV) benchmark consisting of gameplay sequences from custom StarCraft 2 scenarios designed to be more challenging than existing synthetic video datasets. A large-scale human study is presented to validate that FVD correlates with human judgments of generated video quality, along with initial benchmark results comparing several generative models on SCV.

Significance. If the reported correlation holds under scrutiny, FVD would provide a much-needed quantitative, reference-based metric for video generation that accounts for temporal dynamics, filling a gap left by image-centric metrics like FID. The SCV benchmark offers a realistic, high-complexity testbed that could drive progress beyond toy datasets. The human study adds empirical grounding, though its details are essential for adoption. This combination of metric and benchmark has the potential to become a standard evaluation protocol in video generative modeling.

major comments (2)

[Human study] Human study section: The claim that FVD 'correlates well with qualitative human judgment' is central to validating the metric, yet the manuscript provides no details on study design, number of participants, number of videos rated, rating scale, or the statistical procedure (e.g., Pearson or Spearman correlation, p-values, confidence intervals) used to establish the correlation. Without these, the strength of the validation cannot be assessed.
[Benchmark results] Benchmark results section: The initial benchmark results on SCV are presented without reporting variance across multiple runs, details on model training protocols, or ablation studies isolating the contribution of temporal modeling. This makes it difficult to interpret whether performance gaps are due to the metric or to implementation differences.

minor comments (2)

[Abstract] Abstract: 'lead' should be 'led'; 'to this extent' should be 'to this end'.
[Method] Notation: The precise mathematical definition of FVD (mean and covariance of I3D features) should be stated explicitly with an equation, even if it follows the FID formula, to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential impact of FVD and the SCV benchmark. We address each major comment below and will update the manuscript to improve clarity and completeness.

read point-by-point responses

Referee: [Human study] Human study section: The claim that FVD 'correlates well with qualitative human judgment' is central to validating the metric, yet the manuscript provides no details on study design, number of participants, number of videos rated, rating scale, or the statistical procedure (e.g., Pearson or Spearman correlation, p-values, confidence intervals) used to establish the correlation. Without these, the strength of the validation cannot be assessed.

Authors: We agree that the human study details are essential for readers to evaluate the strength of the correlation claim. In the revised manuscript we will expand the relevant section with a full description of the study protocol (including whether ratings were absolute or comparative), the number of participants, the number of videos evaluated, the rating scale, and the exact statistical procedure (correlation type, p-values, and confidence intervals). revision: yes
Referee: [Benchmark results] Benchmark results section: The initial benchmark results on SCV are presented without reporting variance across multiple runs, details on model training protocols, or ablation studies isolating the contribution of temporal modeling. This makes it difficult to interpret whether performance gaps are due to the metric or to implementation differences.

Authors: We accept that additional experimental details would aid interpretation. We will add reported variance across runs (where multiple seeds were used), expanded training-protocol descriptions for the evaluated models, and any ablations that isolate temporal components. Because the paper's primary goal is to introduce the metric and benchmark rather than to exhaustively compare models, we will also clarify this scope while supplying the requested information. revision: partial

Circularity Check

0 steps flagged

No significant circularity in FVD definition or SCV benchmark

full rationale

The paper defines FVD as the Fréchet distance between real and generated video feature distributions extracted via a pre-trained I3D network, directly extending the established FID metric without any self-referential fitting or redefinition of inputs as outputs. The SCV benchmark consists of custom StarCraft 2 gameplay scenarios presented as an external challenge set, and the human study serves as independent empirical validation of correlation rather than a load-bearing derivation step. No equations reduce by construction to fitted parameters, no uniqueness theorems are imported from self-citations, and no ansatzes are smuggled via prior author work. The central claims rest on standard metric construction plus falsifiable human judgments, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a Fréchet distance in learned video feature space serves as a valid proxy for human-perceived video quality, plus the utility of the new benchmark.

axioms (1)

domain assumption FVD correlates with human judgment of generated video quality
Invoked to validate the metric via the claimed human study.

pith-pipeline@v0.9.0 · 5527 in / 1292 out tokens · 65935 ms · 2026-05-11T07:16:33.578360+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.LawOfExistence defect_zero_iff_one unclear
FVD builds on the principles underlying Fréchet Inception Distance (FID)... We introduce a different feature representation that captures the temporal coherence of a video

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PhysInOne: Visual Physics Learning and Reasoning in One Suite
cs.CV 2026-04 unverdicted novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and m...
GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization
cs.CV 2026-05 unverdicted novelty 7.0

GaitProtector optimizes diffusion model latents to impersonate target identities in gait sequences, dropping Rank-1 identification accuracy from 89.6% to 15.0% on CASIA-B while keeping scoliosis diagnostic accuracy at 74.2%.
$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement
cs.CV 2026-05 unverdicted novelty 7.0

h-control introduces block-conditional pseudo-Gibbs refinement for training-free camera control in flow-matching video generators, achieving superior FVD scores on RealEstate10K and DAVIS benchmarks.
Is Your Driving World Model an All-Around Player?
cs.CV 2026-05 unverdicted novelty 7.0

WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.
ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes
cs.CV 2026-05 unverdicted novelty 7.0

ConFixGS repairs feedforward 3D Gaussian Splatting with confidence-aware diffusion priors, delivering up to 3.68 dB PSNR gains and halved FID scores on Waymo, nuScenes, and KITTI novel view synthesis tasks.
One World, Dual Timeline: Decoupled Spatio-Temporal Gaussian Scene Graph for 4D Cooperative Driving Reconstruction
cs.CV 2026-05 unverdicted novelty 7.0

DUST decouples pose trajectories across camera sources in Gaussian scene graphs to resolve ghosting from temporal asynchrony, achieving better PSNR and lower FVD on V2X-Seq data.
One World, Dual Timeline: Decoupled Spatio-Temporal Gaussian Scene Graph for 4D Cooperative Driving Reconstruction
cs.CV 2026-05 unverdicted novelty 7.0

DUST decouples pose trajectories per camera source while sharing canonical Gaussians per agent to remove cross-source gradient conflicts and ghosting caused by temporal asynchrony in 4D cooperative driving scenes.
Do Joint Audio-Video Generation Models Understand Physics?
cs.SD 2026-05 unverdicted novelty 7.0

Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a production knowledge taxonomy, dual-channel conditioning, style-motion curriculum, and deformation-aware preference optimization, outperforming baselines in animator evaluation...
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a structured taxonomy of artistic production variables, dual-channel conditioning, a style-motion curriculum, and deformation-aware optimization to prioritize art over physics.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional ani...
Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

ActDiff-VC achieves up to 64.6% bitrate reduction at matched NIQE and improves perceptual metrics like KID and FID by using content-adaptive keyframe selection and budget-aware sparse trajectory selection to condition...
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 7.0

AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...
ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space
cs.LG 2026-04 unverdicted novelty 7.0

ABC enables any-subset autoregressive generation of continuous stochastic processes via non-Markovian diffusion bridges that track physical time and allow path-dependent conditioning.
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling
cs.CV 2026-04 unverdicted novelty 7.0

Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement ...
OccDirector: Language-Guided Behavior and Interaction Generation in 4D Occupancy Space
cs.CV 2026-04 unverdicted novelty 7.0

OccDirector uses a VLM-guided Spatio-Temporal MMDiT model with history anchoring to generate physically plausible 4D occupancy from language scripts, supported by the new OccInteract-85k dataset.
WorldMark: A Unified Benchmark Suite for Interactive Video World Models
cs.CV 2026-04 unverdicted novelty 7.0

WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.
HumanScore: Benchmarking Human Motions in Generated Videos
cs.CV 2026-04 unverdicted novelty 7.0

HumanScore defines six metrics for kinematic plausibility, temporal stability, and biomechanical consistency to benchmark human motions in videos from thirteen state-of-the-art generation models, revealing gaps betwee...
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
cs.CV 2026-04 unverdicted novelty 7.0

MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
cs.CV 2026-04 unverdicted novelty 7.0

C-MET transfers emotions from speech to facial video by learning cross-modal semantic vectors with pretrained audio and disentangled expression encoders, yielding 14% higher emotion accuracy on MEAD and CREMA-D even f...
MoRight: Motion Control Done Right
cs.CV 2026-04 unverdicted novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control
cs.CV 2026-04 unverdicted novelty 7.0

OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.
Physics-Aware Video Instance Removal Benchmark
cs.CV 2026-04 unverdicted novelty 7.0

The PVIR benchmark tests video object removal on physical consistency using 95 annotated videos and shows that existing methods struggle with complex interactions like lingering shadows.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Video Diffusion Models
cs.CV 2022-04 unverdicted novelty 7.0

A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance...
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.
DiffATS: Diffusion in Aligned Tensor Space
cs.LG 2026-05 unverdicted novelty 6.0

DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with...
Implicit Preference Alignment for Human Image Animation
cs.CV 2026-05 unverdicted novelty 6.0

IPA aligns animation models for superior hand quality via implicit reward maximization on self-generated samples plus hand-focused local optimization, avoiding expensive paired data.
Velox: Learning Representations of 4D Geometry and Appearance
cs.CV 2026-05 unverdicted novelty 6.0

Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...
Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing
cs.RO 2026-05 unverdicted novelty 6.0

A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
cs.CV 2026-05 unverdicted novelty 6.0

M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 6.0

AsymK-Talker introduces kernel-conditioned loop generation, temporal reference encoding, and asymmetric kernel distillation to achieve real-time, drift-resistant talking head synthesis from audio using diffusion models.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 6.0

AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation
cs.CV 2026-04 unverdicted novelty 6.0

HuM-Eval evaluates human motion videos with a coarse-to-fine approach using VLM global checks plus 2D pose and 3D motion analysis, reaching 58.2% average correlation with human judgments and introducing a 1000-prompt ...
EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence
cs.CV 2026-04 unverdicted novelty 6.0

EAD-Net uses a diffusion model with new spatio-temporal attention, graph-based temporal reasoning, and LLM-derived semantic descriptions to generate emotionally expressive talking head videos with improved lip-sync an...
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting
cs.CV 2026-04 unverdicted novelty 6.0

Seen-to-Scene unifies propagation-based and generation-based approaches for video outpainting via fine-tuned flow completion and reference-guided latent propagation to deliver superior temporal coherence and efficiency.
FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers
cs.CV 2026-04 unverdicted novelty 6.0

FreqFormer applies heterogeneous attention (dense global on low frequencies, block-sparse on mid, local on high) plus adaptive spectral routing to reduce attention cost in long-sequence video diffusion transformers.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
cs.CV 2026-04 unverdicted novelty 6.0

Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
cs.CV 2026-04 unverdicted novelty 6.0

A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.
ELT: Elastic Looped Transformers for Visual Generation
cs.CV 2026-04 unverdicted novelty 6.0

Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
Lighting-grounded Video Generation with Renderer-based Agent Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

LiVER conditions video diffusion models on renderer-derived 3D control signals for disentangled, editable control over object layout, lighting, and camera trajectory.
UNICA: A Unified Neural Framework for Controllable 3D Avatars
cs.CV 2026-04 unverdicted novelty 6.0

UNICA unifies motion planning, rigging, physical simulation, and rendering into a single skeleton-free neural framework that produces next-frame 3D avatar geometry from action inputs and renders it with Gaussian splatting.
Unified Video Action Model
cs.RO 2025-02 unverdicted novelty 6.0

UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
cs.CV 2024-12 unverdicted novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
cs.CV 2024-04 unverdicted novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
Latte: Latent Diffusion Transformer for Video Generation
cs.CV 2024-01 unverdicted novelty 6.0

Latte achieves state-of-the-art video generation on FaceForensics, SkyTimelapse, UCF101, and Taichi-HD by using a latent diffusion transformer with four efficient spatial-temporal decomposition variants and best-pract...
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models
cs.CV 2026-05 unverdicted novelty 5.0

Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
Syn4D: A Multiview Synthetic 4D Dataset
cs.CV 2026-05 unverdicted novelty 5.0

Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
Video Generation with Predictive Latents
cs.CV 2026-05 unverdicted novelty 5.0

PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.
DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

DepthPilot generates physically consistent and clinically interpretable colonoscopy videos by injecting depth priors into diffusion models through parameter-efficient fine-tuning and replacing linear denoising weights...
AutoAWG: Adverse Weather Generation with Adaptive Multi-Controls for Automotive Videos
cs.CV 2026-04 unverdicted novelty 5.0

AutoAWG generates controllable adverse weather automotive videos via semantics-guided adaptive multi-control fusion and vanishing-point-anchored temporal synthesis from static images, reducing FID by 50% and FVD by 16...
Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation
cs.CV 2026-04 unverdicted novelty 5.0

Prompt-driven image-to-video generation produces deictic gestures that match real data visually, add useful variety, and improve downstream recognition models when mixed with human recordings.
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
cs.RO 2026-04 unverdicted novelty 5.0

Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
cs.CV 2022-05 unverdicted novelty 5.0

CogVideo is a large-scale transformer pretrained for text-to-video generation that outperforms public models in evaluations.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
ModelScope Text-to-Video Technical Report
cs.CV 2023-08 unverdicted novelty 4.0

ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 55 Pith papers

[1]

Babaeizadeh, C

M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic variational video prediction. Inter- national Conference on Learning Representations (ICLR) ,

work page
[2]

Bi ´nkowski, D

M. Bi ´nkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying MMD GANs. International Conference on Learning Representations (ICLR), 2018. 3, 6

work page 2018
[3]

Brock, J

A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high ﬁdelity natural image synthesis. arXiv,

work page
[4]

Byeon, Q

W. Byeon, Q. Wang, R. K. Srivastava, P. Koumoutsakos, P. Vlachas, Z. Wan, T. Sapsis, F. Raue, S. Palacio, T. Breuel, et al. Contextvp: Fully context-aware video prediction. In Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition Workshops, pages 1122–1126,

work page
[5]

Carreira and A

J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3, 6

work page 2017
[6]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei. Imagenet: A large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2009. 2, 6

work page 2009
[7]

Denton and R

E. Denton and R. Fergus. Stochastic video generation with a learned prior. International Conference on Machine Learn- ing (ICML), 2018. 2, 6

work page 2018
[8]

Dowson and B

D. Dowson and B. Landau. The frchet distance between mul- tivariate normal distributions. Journal of Multivariate Anal- ysis, 12(3):450 – 455, 1982. 2

work page 1982
[9]

Ebert, C

F. Ebert, C. Finn, A. X. Lee, and S. Levine. Self-supervised visual planning with temporal skip connections. In Confer- ence on Robot Learning, pages 344–356, 2017. 1, 6, 11

work page 2017
[10]

C. Finn, I. Goodfellow, and S. Levine. Unsupervised learn- ing for physical interaction through video prediction. Ad- vances in Neural Information Processing Systems (NIPS) ,

work page
[11]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Gen- erative adversarial nets. Advances in neural information pro- cessing systems (NIPS), 2014. 1

work page 2014
[12]

Gretton, K

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch¨olkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012. 3

work page 2012
[13]

Haller and M

E. Haller and M. Leordeanu. Unsupervised object segmen- tation in video by efﬁcient selection of highly probable pos- itive features. IEEE International Conference on Computer Vision (ICCV), 2017. 1

work page 2017
[14]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems (NIPS), 2017. 2, 3, 6

work page 2017
[15]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. 1

work page 1997
[16]

Huynh-Thu and M

Q. Huynh-Thu and M. Ghanbari. The accuracy of psnr in predicting video quality for different video scenes and frame rates. Telecommunication Systems, 2012. 2, 3

work page 2012
[17]

Isola, J.-Y

P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros. Image- to-image translation with conditional adversarial networks. IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2017. 1

work page 2017
[18]

Jiang, D

H. Jiang, D. Sun, V . Jampani, M.-H. Yang, E. Learned- Miller, and J. Kautz. Super slomo: High quality estima- tion of multiple intermediate frames for video interpolation. IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2018. 1

work page 2018
[19]

Kalchbrenner, A

N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Dani- helka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks. International Conference on Machine Learning (ICML), 2017. 2

work page 2017
[20]

Karras, T

T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and varia- tion. International Conference on Learning Representations (ICLR), 2018. 1

work page 2018
[21]

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. arXiv, 2017. 3, 11

work page 2017
[22]

Kuehne, H

H. Kuehne, H. Jhuang, R. Stiefelhagen, and T. Serre. Hmdb51: A large video database for human motion recogni- tion. In High Performance Computing in Science and Engi- neering 12, pages 571–582. Springer, 2013. 3, 6, 11

work page 2013
[23]

B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Ger- shman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017. 1

work page 2017
[24]

A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction. arXiv,

work page
[25]

Lerer, S

A. Lerer, S. Gross, and R. Fergus. Learning physical in- tuition of block towers by example. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48 , pages 430–438. JMLR. org, 2016. 1

work page 2016
[26]

Lotter, G

B. Lotter, G. Kreiman, and D. Cox. Deep Predictive Cod- ing Networks for Video Prediction and Unsupervised Learn- ing. International Conference on Learning Representations (ICLR), 2017. 1

work page 2017
[27]

P. Luc, C. Couprie, S. Chintala, and J. Verbeek. Semantic segmentation using adversarial networks. arXiv, 2016. 1

work page 2016
[28]

Lucic, K

M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bous- quet. Are gans created equal? a large-scale study. Advances in Neural Information Processing Systems (NIPS), 2018. 6

work page 2018
[29]

Mathieu, C

M. Mathieu, C. Couprie, and Y . LeCun. Deep multi-scale video prediction beyond mean square error. International Conference on Learning Representations (ICLR), 2016. 1

work page 2016
[30]

Ponomarenko, L

N. Ponomarenko, L. Jin, O. Ieremeiev, V . Lukin, K. Egiazar- ian, J. Astola, B. V ozel, K. Chehdi, M. Carli, F. Battisti, et al. Image database tid2013: Peculiarities, results and perspec- tives. Signal Processing: Image Communication, 30:57–77,

work page
[31]

Preuer, P

K. Preuer, P. Renz, T. Unterthiner, S. Hochreiter, and G. Klambauer. Frchet chemnet distance: A metric for gen- erative models for molecules in drug discovery. Journal of Chemical Information and Modeling , 58(9):1736–1741,

work page
[32]

Radosavovic, P

I. Radosavovic, P. Dollr, R. Girshick, G. Gkioxari, and K. He. Data distillation: Towards omni-supervised learning. IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2018. 1

work page 2018
[33]

Ranzato, A

M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv, 2014. 1

work page 2014
[34]

Saito, E

M. Saito, E. Matsumoto, and S. Saito. Temporal generative adversarial nets with singular value clipping. International Conference on Computer Vision (ICCV), 2017. 2

work page 2017
[35]

M. S. Sajjadi, B. Sch ¨olkopf, and M. Hirsch. Enhancenet: Single image super-resolution through automated texture synthesis. International Conference on Computer Vision (ICCV), 2017. 1

work page 2017
[36]

Schuldt, I

C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In Proceedings of the 17th In- ternational Conference on Pattern Recognition (ICPR), vol- ume 3, pages 32–36. IEEE, 2004. 3

work page 2004
[37]

Soomro, A

K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv,

work page
[38]

Srivastava, E

N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsuper- vised learning of video representations using lstms. Interna- tional Conference on Machine Learning (ICML) , 2015. 1, 4

work page 2015
[39]

Szegedy, V

C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2016. 2

work page 2016
[40]

Theis, A

L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. International Conference on Learning Representations (ICLR), 2016. 6

work page 2016
[41]

Tulyakov, M.-Y

S. Tulyakov, M.-Y . Liu, X. Yang, and J. Kautz. Moco- gan: Decomposing motion and content for video generation. IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2018. 2

work page 2018
[42]

Unterthiner, B

T. Unterthiner, B. Nessler, C. Seward, G. Klambauer, M. Heusel, H. Ramsauer, and S. Hochreiter. Coulomb GANs: Provably Optimal Nash Equilibria via Potential Fields. International Conference on Learning Representa- tions (ICLR), 2018. 6

work page 2018
[43]

Vaswani, S

A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, L. Kaiser, N. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer, and J. Uszkoreit. Tensor2tensor for neural machine translation. arXiv, 2018. 13

work page 2018
[44]

Villegas, D

R. Villegas, D. Erhan, H. Lee, et al. Hierarchical long-term video prediction without supervision. In International Con- ference on Machine Learning, pages 6033–6041, 2018. 2

work page 2018
[45]

Vinyals, T

O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhn- evets, M. Yeo, A. Makhzani, H. K¨uttler, J. Agapiou, J. Schrit- twieser, et al. StarCraft II: A new challenge for reinforce- ment learning. arXiv, 2017. 2, 3, 4

work page 2017
[46]

V ondrick, H

C. V ondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. Proceedings of the 30th Inter- national Conference on Neural Information Processing Sys- tems (NIPS), 2016. 2

work page 2016
[47]

Wang, M.-Y

T.-C. Wang, M.-Y . Liu, J.-Y . Zhu, N. Yakovenko, A. Tao, J. Kautz, and B. Catanzaro. Video-to-video synthesis. In Advances in Neural Information Processing Systems , pages 1152–1164, 2018. 2

work page 2018
[48]

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 2004. 2, 3

work page 2004
[49]

Zhang, P

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a percep- tual metric. IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), 2018. 3 A. Noise Study We conduct the noise study on HMDB [22], BAIR [9], and Kinetics-400 [21]. A total of 90% of the available samples (train and test) ...

work page 2018