arxiv: 2407.02371 · v3 · submitted 2024-07-02 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan , Rui Xie , Penghao Zhou , Tiehan Fan , Zhenheng Yang , Zhijie Chen , Xiang Li , Jian Yang

show 1 more author

Ying Tai

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-video generationvideo datasetdiffusion transformerhigh-quality captionsmulti-modal modelOpenVid-1MMVDiT

0 comments

The pith

OpenVid-1M supplies over a million precise text-video pairs with expressive captions to improve text-to-video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the shortage of precise, open high-quality datasets for text-to-video generation and the limited use of textual information in existing models. It introduces OpenVid-1M with more than one million text-video pairs and derives a high-definition subset of 433K 1080p videos. The work also presents a Multi-modal Video Diffusion Transformer that jointly processes visual structure and text semantics. Experiments claim this dataset outperforms prior collections such as WebVid-10M and Panda-70M while the new architecture extracts semantic details more effectively.

Core claim

We introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million text-video pairs, facilitating research on T2V generation. Furthermore, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing high-definition video generation. Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from visual tokens and semantic information from text tokens.

What carries the argument

The Multi-modal Video Diffusion Transformer (MVDiT), which extracts structure from visual tokens and semantics from text tokens through joint multi-modal processing.

If this is right

OpenVid-1M enables smaller research groups to train competitive text-to-video models without relying on proprietary or oversized collections.
OpenVidHD-0.4M directly supports experiments in 1080p video generation.
MVDiT improves semantic fidelity in generated videos by moving beyond simple cross-attention for text prompts.
Ablation studies in the paper link the joint visual-text processing in MVDiT to measurable performance lifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The curation approach could be adapted to create similarly precise datasets for other video tasks such as action recognition or video captioning.
Wider adoption of OpenVid-1M might standardize evaluation protocols across text-to-video papers.
Extensions of MVDiT to longer sequences or additional conditioning signals remain open for follow-up work.

Load-bearing premise

The newly collected videos and captions in OpenVid-1M are verifiably higher quality and more precise than those in prior datasets such as WebVid-10M and Panda-70M.

What would settle it

A controlled training run showing no measurable gains in video quality or text alignment metrics when models use OpenVid-1M instead of WebVid-10M would falsify the dataset superiority claim.

read the original abstract

Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The previous popular video datasets, e.g. WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is challenging but crucial to collect a precise high-quality text-video pairs for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, using a simple cross attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt. To address these issues, we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million text-video pairs, facilitating research on T2V generation. Furthermore, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing high-definition video generation. Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from visual tokens and semantic information from text tokens. Extensive experiments and ablation studies verify the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenVid-1M gives the field a usable new open dataset at scale, but the quality-superiority claims lack the independent metrics needed to stand on their own.

read the letter

The paper's real contribution is the release of OpenVid-1M, a million-pair text-video collection with an HD subset and a joint-token MVDiT variant that tries to pull structure from visuals and semantics from text at the same time. That combination addresses a practical bottleneck for groups that cannot afford to scrape or license their own data at this volume. The curation steps described—source filtering, captioning, and HD extraction—are concrete and the open release itself is the part most labs will actually use. The model tweak is a straightforward extension of existing diffusion transformers rather than a conceptual leap, but it is implemented and tested on the new data. The experiments claim better results than baselines trained on WebVid or Panda-70M, which is the expected outcome when you train on more and better-curated material. The soft spot is that the quality advantage is asserted without the usual supporting numbers: no caption fidelity scores against human references, no standard video quality indices, and no controlled swap where only the dataset changes while architecture, optimizer, and schedule stay fixed. Without those, it is difficult to separate the effect of the new data from simple scale or training differences. The paper is aimed at T2V researchers who need training resources more than novel theory. A serious referee should see it to verify the experimental controls and the exact curation pipeline; the dataset alone makes that worthwhile even if the model gains need tighter isolation.

Referee Report

2 major / 2 minor

Summary. The paper introduces OpenVid-1M, an open dataset of over 1 million text-video pairs with expressive captions for text-to-video (T2V) generation, along with a 433K 1080p high-definition subset (OpenVidHD-0.4M). It also proposes the Multi-modal Video Diffusion Transformer (MVDiT) that jointly processes visual tokens for structure and text tokens for semantics. The central claims are that OpenVid-1M is superior in quality and precision to prior datasets such as WebVid-10M and Panda-70M, and that MVDiT is more effective than existing T2V methods, as verified by experiments and ablations.

Significance. If the dataset curation claims can be supported by objective, reproducible quality metrics independent of downstream performance and if the experimental comparisons properly isolate the contributions of data and architecture, this would supply a much-needed large-scale open resource for T2V research and a practical architecture improvement for better text conditioning in diffusion transformers.

major comments (2)

[§3] §3: The curation pipeline (source selection, filtering, and captioning) is described in detail, yet no quantitative validation is provided—no automated caption fidelity metrics (CIDEr, SPICE, or similar against human references), no standardized video quality scores (BRISQUE, NIQE, motion coherence), and no controlled human preference study with statistical reporting. This leaves the repeated claim of 'precise high-quality' and 'expressive captions' superior to WebVid-10M and Panda-70M without independent evidence.
[§4] §4: The reported experiments and ablations compare MVDiT trained on OpenVid-1M against baselines, but do not fix model architecture, optimizer, learning-rate schedule, and compute budget while swapping only the training dataset. Consequently, observed gains cannot be attributed specifically to dataset quality rather than differences in scale, diversity, or training procedure.

minor comments (2)

[Abstract] Abstract: The superiority claim is stated without any numerical metrics or baseline names, which reduces immediate readability; adding one or two key quantitative results would strengthen the summary.
[§1] §1: A compact comparison table of prior datasets (size, resolution, caption quality indicators, public availability) would help readers quickly situate OpenVid-1M relative to WebVid-10M and Panda-70M.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate planned revisions to improve the manuscript.

read point-by-point responses

Referee: [§3] §3: The curation pipeline (source selection, filtering, and captioning) is described in detail, yet no quantitative validation is provided—no automated caption fidelity metrics (CIDEr, SPICE, or similar against human references), no standardized video quality scores (BRISQUE, NIQE, motion coherence), and no controlled human preference study with statistical reporting. This leaves the repeated claim of 'precise high-quality' and 'expressive captions' superior to WebVid-10M and Panda-70M without independent evidence.

Authors: We agree that independent quantitative validation would strengthen the quality claims. The manuscript primarily demonstrates superiority via downstream T2V performance gains. In revision we will add no-reference metrics (NIQE, BRISQUE) computed on sampled frames from OpenVid-1M versus WebVid-10M and Panda-70M, plus CLIP text-video alignment scores as a proxy for caption fidelity. A full human preference study with statistical reporting is not feasible at this scale without new resources, so we will instead expand qualitative examples and report caption statistics (length, vocabulary diversity). revision: partial
Referee: [§4] §4: The reported experiments and ablations compare MVDiT trained on OpenVid-1M against baselines, but do not fix model architecture, optimizer, learning-rate schedule, and compute budget while swapping only the training dataset. Consequently, observed gains cannot be attributed specifically to dataset quality rather than differences in scale, diversity, or training procedure.

Authors: We thank the referee for highlighting this isolation issue. Our current ablations vary data scale within OpenVid but do not fully control hyperparameters across all external baselines. In the revised manuscript we will add a controlled experiment that trains the identical MVDiT architecture on OpenVid-1M and a size-matched WebVid subset using the same optimizer, learning-rate schedule, and compute budget, allowing direct attribution of gains to dataset differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity; contributions are new dataset curation and architecture proposal

full rationale

The paper introduces OpenVid-1M via a curation pipeline and proposes MVDiT as a new architecture, then reports experimental comparisons. No load-bearing steps reduce by construction to fitted parameters, self-citations, or self-definitions. Claims of dataset superiority rest on described filtering and captioning processes rather than any equation or prediction that loops back to its own inputs. Experiments in section 4 compare models but do not invoke uniqueness theorems or ansatzes from prior self-work as forcing functions. This is a standard empirical contribution with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; ledger reflects stated assumptions of accurate captioning and effective joint token processing. No free parameters or invented entities are introduced.

axioms (1)

domain assumption High-quality text-video pairs can be reliably collected and captioned at scale from open sources
Invoked when claiming the dataset is precise and expressive compared with prior collections.

pith-pipeline@v0.9.0 · 5587 in / 1235 out tokens · 38834 ms · 2026-05-14T20:30:33.458442+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce OpenVid-1M, a precise high-quality dataset with expressive captions... propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from visual tokens and semantic information from text tokens
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our OpenVid-1M has several characteristics: 1) Superior in quantity... 2) Superior in visual quality... 3) Expressive in caption

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
cs.CV 2026-05 unverdicted novelty 7.0

MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
cs.CV 2026-05 unverdicted novelty 7.0

MoCam uses structured denoising dynamics in diffusion models to temporally decouple geometric alignment from appearance refinement, enabling unified novel view synthesis that outperforms prior methods on imperfect poi...
OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer
cs.CV 2026-04 unverdicted novelty 7.0

OmniShotCut treats shot boundary detection as structured relational prediction via a shot-query Transformer, uses fully synthetic transitions for training data, and releases OmniShotCutBench for evaluation.
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

MuSS is a movie-derived dataset and benchmark that enables AI models to generate multi-shot videos with coherent narratives and preserved subject identity across shots.
FlowC2S: Flowing from Current to Succeeding Frames for Fast and Memory-Efficient Video Continuation
cs.CV 2026-04 unverdicted novelty 7.0

FlowC2S generates video continuations by flowing directly from current to next frames in a fine-tuned flow model, using adjacent chunks as optimal couplings and target inversion to cut input size in half and beat SOTA...
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
cs.CV 2026-04 unverdicted novelty 7.0

DreamStereo uses GAPW, PBDP, and SASI to enable real-time stereo video inpainting at 25 FPS for HD videos by reducing over 70% redundant computation while maintaining quality.
MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models
cs.CV 2026-04 unverdicted novelty 7.0

MotionScape is a large-scale UAV video dataset with highly dynamic 6-DoF motions, geometric trajectories, and semantic annotations to train world models that better simulate complex 3D dynamics under large viewpoint changes.
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
cs.CV 2026-05 unverdicted novelty 6.0

SwiftI2V matches end-to-end 2K I2V quality on VBench while cutting GPU time by 202x via conditional segment-wise generation that bounds token cost and preserves input fidelity.
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
cs.CV 2026-05 unverdicted novelty 6.0

SwiftI2V achieves comparable 2K I2V quality to end-to-end models on VBench-I2V while cutting GPU time by 202x through low-resolution motion planning followed by strongly image-conditioned segment-wise high-resolution ...
Seeing Fast and Slow: Learning the Flow of Time in Videos
cs.CV 2026-04 unverdicted novelty 6.0

Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
cs.CV 2026-04 unverdicted novelty 6.0

VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning
cs.CV 2026-04 unverdicted novelty 6.0

VibeFlow performs versatile video chroma-lux editing in zero-shot fashion by self-supervised disentanglement of structure and color-illumination cues inside pre-trained video models, plus residual velocity fields and ...
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
Latent-Compressed Variational Autoencoder for Video Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
cs.CV 2026-04 unverdicted novelty 6.0

SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...
PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 6.0

INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
From Ideal to Real: Stable Video Object Removal under Imperfect Conditions
cs.CV 2026-03 unverdicted novelty 6.0

SVOR achieves stable, shadow-free video object removal under real-world imperfections via MUSE mask handling, DA-Seg localization, and curriculum training on real and synthetic data.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Show-o2: Improved Native Unified Multimodal Models
cs.CV 2025-06 unverdicted novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 21 Pith papers · 7 internal anchors

[1]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

URL https://openai.com/research/ video-generation-models-as-world-simulators . Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a. Haoxin Chen, Yong Zhang, Xiao...

work page internal anchor Pith review arXiv
[3]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

11 Published as a conference paper at ICLR 2025 Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan

URL https:// llava-vl.github.io/blog/2024-01-30-llava-next/ . 11 Published as a conference paper at ICLR 2025 Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. arXiv preprint arXiv:2310.11440, 2023b. Haoyu ...

work page arXiv 2024
[5]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024a. Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and P...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Mod- elscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a. Xiang* Wang, Hangjie* Yuan, Shiwei* Zhang, Dayou* Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Videocomposer: Compositional video synthesis with motion controll...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks

12 Published as a conference paper at ICLR 2025 Wei Xiong, Wenhan Luo, Lin Ma, Wei Liu, and Jiebo Luo. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2364–2373,

work page 2025
[9]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Celebv-text: A large-scale facial text-video dataset

Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Weidong Cai, and Wayne Wu. Celebv-text: A large-scale facial text-video dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14805–14814, 2023a. Sihyun Yu, Kihyuk Sohn, Subin Kim, and Jinwoo Shin. Video probabilistic diffusion models in projected latent space...

work page arXiv
[11]

Make pixels dance: High-dynamic video generation

Yan Zeng, Guoqiang Wei, Jiani Zheng, Jiaxin Zou, Yang Wei, Yuchen Zhang, and Hang Li. Make pixels dance: High-dynamic video generation. arXiv preprint arXiv:2311.10982,

work page arXiv
[12]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818,

work page arXiv
[13]

We adopted the LAION Aesthetic Predictor and DOVER (Wu et al.,

13 Published as a conference paper at ICLR 2025 A M ORE IMPLEMENTATION DETAILS A.1 D ATA PROCESSING PIPELINE Aesthetics and Clarity Assessment. We adopted the LAION Aesthetic Predictor and DOVER (Wu et al.,

work page 2025
[14]

Motion Difference

between every two adjacent frames in the video and take the average as an indicator of the temporal consistency, measuring the coherence and consistency of the video frames. Motion Difference. To measure motion amplitude, we utilize UniMatch (Xu et al., 2023), a pretrained state-of-the-art optical flow estimation method that is both efficient and accurate...

work page 2023
[15]

a dog wearing vr goggles on a boat

Combining all four steps yields the highest scores in most metrics. 14 Published as a conference paper at ICLR 2025 0.05 0.01 0.20 0.15 7.5 8.5 4.5 2.1 (a) (b) 100 150 0.013 50 0.8 0.9 0.95 0.99 (c) (d) Figure 8: Visualizations of the videos with varying (a) clarity, (b) aesthetic, (c) motion, and (d) temporal consistency scores. Table 9: Ablation studies...

work page 2025
[16]

Unicorn sliding on a rainbow

Specifically, OpenVid-1M consists of 1,019,957 clips, averaging 7.2 seconds each, with a total video length of 2,051 hours. Compared to previous million-level datasets, WebVid- 10M contains low-quality videos with watermarks and Panda-70M contains many still, flickering, or blurry videos along with short captions. In contrast, our OpenVid-1M contains high...

work page 2025