pith. machine review for the scientific record. sign in

arxiv: 2205.15868 · v1 · submitted 2022-05-29 · 💻 cs.CV · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Jie Tang, Ming Ding, Wendi Zheng, Wenyi Hong, Xinghan Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 12:17 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG
keywords text-to-video generationtransformer modelpretrainingvideo synthesishierarchical trainingimage-to-video transferCogVideo
0
0 comments X

The pith

CogVideo generates videos from text by inheriting weights from a text-to-image model and applying multi-frame-rate hierarchical training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CogVideo, a 9-billion-parameter transformer for text-to-video generation. It tackles the prohibitive cost of training video models from scratch and the shortage of well-aligned text-video data by starting from the CogView2 image model and adding staged training that aligns descriptions with clips at varying speeds. A reader would care because the resulting open-source system produces more coherent motion and semantics than other public models, as measured by both automated metrics and human judgments. This shows a workable path to scale video synthesis without building every capability from zero data.

Core claim

Large-scale pretrained transformers have created milestones in text and text-to-image generation, yet video generation faces huge computation costs and scarce relevant datasets. We present the 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model CogView2. We also propose a multi-frame-rate hierarchical training strategy to better align text and video clips. As the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.

What carries the argument

Weight inheritance from the CogView2 text-to-image model plus multi-frame-rate hierarchical training, which transfers static image understanding to dynamic video while aligning text semantics across frame rates.

If this is right

  • Generated videos exhibit stronger alignment between text descriptions and complex movements.
  • An open-source model at this scale becomes available for further research and applications.
  • Video generation can be scaled without full from-scratch training on massive video corpora.
  • The approach demonstrates transfer of capabilities from image to video domains via staged alignment training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same inheritance-plus-hierarchy pattern might support longer or higher-resolution videos if compute budgets increase.
  • Fine-tuning on domain-specific video sets could adapt the model for tasks such as animation or simulation.
  • Combining the output with audio or 3D models could extend the system toward richer multimedia generation.

Load-bearing premise

That inheriting weights from a text-to-image model plus multi-frame-rate hierarchical training is enough to overcome scarce text-video data and the high cost of training video models from scratch.

What would settle it

Blind human preference tests or standard video quality metrics such as FVD in which CogVideo does not show a clear margin over other publicly released text-to-video systems.

read the original abstract

Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CogVideo, a 9B-parameter transformer for text-to-video generation. It inherits weights from the pretrained CogView2 text-to-image model to reduce compute costs and applies a multi-frame-rate hierarchical training strategy to improve text-video alignment despite limited relevant data. The authors claim CogVideo is likely the first open-source large-scale pretrained text-to-video model and outperforms all publicly available models by a large margin in both machine and human evaluations.

Significance. If the performance claims hold under rigorous controls, this would be a meaningful early contribution to text-to-video generation by showing how weight inheritance from image models and hierarchical training can scale to 9B parameters. The open release of the model is a clear strength that could enable follow-on work, analogous to the role of early large text and image models. However, the significance is reduced because the central empirical claim depends on unshown evidence that the proposed techniques, rather than model scale or dataset choices alone, drive the gains.

major comments (3)
  1. [§4 (Experiments)] §4 (Experiments): No ablation is presented that isolates the effect of inheriting weights from CogView2 versus random initialization at 9B scale. This is load-bearing for the introduction's claim that inheritance overcomes text-video data scarcity; without it, observed gains could be explained by capacity or data alone.
  2. [§4.1 (Evaluation protocol)] §4.1 (Evaluation protocol): The multi-frame-rate hierarchical training is not compared against a single-rate baseline in controlled experiments. This weakens the assertion that the hierarchical schedule is responsible for improved text-video alignment, as required to support the 'large margin' superiority claim.
  3. [§4 (Experiments)] §4 (Experiments): The manuscript supplies no quantitative metrics (e.g., specific FID, CLIP-score, or human preference percentages), named baselines, or dataset statistics to substantiate the 'outperforms all publicly available models at a large margin' statement. These details are necessary to evaluate the central empirical claim.
minor comments (2)
  1. [Abstract] The abstract would be improved by including at least one concrete quantitative result to support the performance claims.
  2. Figure captions should be expanded to be self-contained, especially for any qualitative generation examples.

Simulated Author's Rebuttal

3 responses · 2 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing honest clarifications on our design choices and empirical claims while noting where revisions are feasible.

read point-by-point responses
  1. Referee: §4 (Experiments): No ablation is presented that isolates the effect of inheriting weights from CogView2 versus random initialization at 9B scale. This is load-bearing for the introduction's claim that inheritance overcomes text-video data scarcity; without it, observed gains could be explained by capacity or data alone.

    Authors: We agree that a controlled ablation isolating weight inheritance at the full 9B scale would strengthen the claim regarding data scarcity. However, training a 9B-parameter model from random initialization requires prohibitive compute (estimated >10,000 GPU-hours per run), which exceeded our resources. Our approach follows established transfer-learning practices from image to video models, with performance gains shown via overall machine and human evaluations. In revision, we will expand Section 4 with additional discussion of this limitation and any supporting evidence from smaller-scale pretraining experiments. revision: partial

  2. Referee: §4.1 (Evaluation protocol): The multi-frame-rate hierarchical training is not compared against a single-rate baseline in controlled experiments. This weakens the assertion that the hierarchical schedule is responsible for improved text-video alignment, as required to support the 'large margin' superiority claim.

    Authors: We acknowledge that a direct single-rate baseline comparison would better isolate the hierarchical strategy's contribution. The multi-frame-rate approach was introduced to address varying motion speeds and improve alignment under data constraints, with benefits visible in qualitative results and overall metrics. Due to compute limits, this specific ablation was not performed. We will revise the manuscript to elaborate on the design rationale, add qualitative comparisons where possible, and list the missing ablation as a limitation and future direction. revision: partial

  3. Referee: §4 (Experiments): The manuscript supplies no quantitative metrics (e.g., specific FID, CLIP-score, or human preference percentages), named baselines, or dataset statistics to substantiate the 'outperforms all publicly available models at a large margin' statement. These details are necessary to evaluate the central empirical claim.

    Authors: We will revise the experiments section to report the specific quantitative metrics (FID, CLIP-score, human preference percentages), explicitly name all public baselines compared, and include dataset statistics. These details were available from our evaluations but omitted for brevity in the initial submission; adding them will allow direct assessment of the performance claims. revision: yes

standing simulated objections not resolved
  • Full 9B-scale ablation isolating weight inheritance from random initialization
  • Controlled ablation comparing multi-frame-rate hierarchical training to single-rate baseline

Circularity Check

0 steps flagged

No significant circularity; empirical performance claim is independently evaluated

full rationale

The paper's central claim is an empirical statement that CogVideo outperforms public baselines after inheriting weights from CogView2 and applying multi-frame-rate hierarchical training. This is supported by machine and human evaluations on external benchmarks rather than any derivation that reduces to its own inputs by construction. No equations, fitted parameters renamed as predictions, or load-bearing self-citations that forbid alternatives appear in the provided abstract or described methodology. The inheritance from CogView2 and the training strategy are presented as engineering choices whose effectiveness is measured externally, not assumed or defined into the result. This is a standard self-contained empirical ML paper with no circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review limited to abstract; no explicit free parameters, axioms, or invented entities are described beyond standard transformer assumptions.

axioms (1)
  • domain assumption A pretrained text-to-image transformer can be effectively adapted to video by adding temporal training
    Invoked when the paper states it inherits from CogView2 to address computation cost

pith-pipeline@v0.9.0 · 5434 in / 1076 out tokens · 58400 ms · 2026-05-11T12:17:32.445086+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MusicLM: Generating Music From Text

    cs.SD 2023-01 conditional novelty 8.0

    MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.

  2. GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

    cs.CV 2026-05 unverdicted novelty 7.0

    GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.

  3. OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.

  4. DCR: Counterfactual Attractor Guidance for Rare Compositional Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.

  5. DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

    cs.CV 2026-04 unverdicted novelty 7.0

    DeVI enables zero-shot physically plausible dexterous control by imitating synthetic videos via a hybrid 3D-human plus 2D-object tracking reward.

  6. ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

    cs.CV 2026-04 unverdicted novelty 7.0

    ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.

  7. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.

  8. UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy...

  9. MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production

    cs.MM 2026-04 unverdicted novelty 7.0

    MCSC-Bench is the first large-scale dataset for the Multimodal Context-to-Script Creation task, requiring models to select relevant shots from redundant materials, plan missing shots, and generate coherent scripts wit...

  10. LottieGPT: Tokenizing Vector Animation for Autoregressive Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    LottieGPT tokenizes Lottie animations into compact sequences and fine-tunes Qwen-VL to autoregressively generate coherent vector animations from natural language or visual prompts, outperforming prior SVG models.

  11. MoRight: Motion Control Done Right

    cs.CV 2026-04 unverdicted novelty 7.0

    MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...

  12. OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control

    cs.CV 2026-04 unverdicted novelty 7.0

    OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.

  13. Detecting AI-Generated Videos with Spiking Neural Networks

    cs.CV 2026-05 unverdicted novelty 6.0

    MAST with spiking neural networks achieves 93.14% mean accuracy detecting AI-generated videos from 10 unseen generators by exploiting smoother pixel residuals and compact semantic trajectories.

  14. Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditi...

  15. UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

    cs.CV 2026-05 unverdicted novelty 6.0

    UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.

  16. PhyCo: Learning Controllable Physical Priors for Generative Motion

    cs.CV 2026-04 unverdicted novelty 6.0

    PhyCo adds continuous physical control to video diffusion models via physics-supervised fine-tuning on a large simulation dataset and VLM-guided rewards, yielding measurable gains in physical realism on the Physics-IQ...

  17. Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.

  18. Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization

    cs.CV 2026-04 unverdicted novelty 6.0

    A new regularizer transfers frequency awareness from state-space models into image tokenizers, yielding more compact latents that improve diffusion-model generation quality with little reconstruction penalty.

  19. ELT: Elastic Looped Transformers for Visual Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.

  20. InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation

    cs.CV 2026-04 unverdicted novelty 6.0

    InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.

  21. SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

    cs.CV 2026-04 unverdicted novelty 6.0

    SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...

  22. GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads

    cs.DC 2026-04 unverdicted novelty 6.0

    GENSERVE improves SLO attainment by up to 44% for co-serving heterogeneous T2I and T2V diffusion workloads via step-level preemption, elastic parallelism, and joint scheduling.

  23. VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    cs.CV 2025-03 accept novelty 6.0

    VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...

  24. Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    cs.CV 2024-12 unverdicted novelty 6.0

    Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.

  25. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    cs.CV 2024-08 unverdicted novelty 6.0

    CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.

  26. CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.

  27. Make-A-Video: Text-to-Video Generation without Text-Video Data

    cs.CV 2022-09 unverdicted novelty 6.0

    Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.

  28. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 5.0

    R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.

  29. ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.

  30. Embody4D: A Generalist 4D World Model for Embodied AI

    cs.CV 2026-05 unverdicted novelty 5.0

    Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.

  31. StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

    cs.RO 2026-04 unverdicted novelty 5.0

    StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...

  32. Controllable Video Object Insertion via Multiview Priors

    cs.CV 2026-04 unverdicted novelty 5.0

    A multi-view prior-based framework for video object insertion that uses dual-path conditioning and an integration-aware consistency module to improve appearance stability and occlusion handling.

  33. Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    PASA uses curvature-aware dynamic budgeting, grouped approximations, and stochastic attention routing to accelerate video diffusion transformers while eliminating temporal flickering from sparse patterns.

  34. Not all tokens contribute equally to diffusion learning

    cs.CV 2026-04 unverdicted novelty 5.0

    DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.

  35. Open-Sora: Democratizing Efficient Video Production for All

    cs.CV 2024-12 unverdicted novelty 5.0

    Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tas...

  36. Movie Gen: A Cast of Media Foundation Models

    cs.CV 2024-10 unverdicted novelty 5.0

    A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

  37. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  38. ModelScope Text-to-Video Technical Report

    cs.CV 2023-08 unverdicted novelty 4.0

    ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.

  39. Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    cs.CV 2024-02 unverdicted novelty 2.0

    The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 39 Pith papers · 8 internal anchors

  1. [1]

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

  2. [2]

    Carreira and A

    J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 6299–6308, 2017

  3. [3]

    7, 13, 16

    J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018

  4. [4]

    arXiv:1907.06571 , year=

    A. Clark, J. Donahue, and K. Simonyan. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019

  5. [5]

    M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34, 2021

  6. [6]

    M. Ding, W. Zheng, W. Hong, and J. Tang. Cogview2: Faster and better text-to-image generation via hierarchical transformers. arXiv preprint arXiv:2204.14217, 2022

  7. [7]

    Esser, R

    P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. arXiv preprint arXiv:2012.09841, 2020

  8. [8]

    C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. Advances in neural information processing systems , 29, 2016

  9. [9]

    S. Ge, T. Hayes, H. Yang, X. Yin, G. Pang, D. Jacobs, J.-B. Huang, and D. Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. arXiv preprint arXiv:2204.03638, 2022

  10. [10]

    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial networks.arXiv preprint arXiv:1406.2661, 2014

  11. [11]

    J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022

  12. [12]

    Karpathy, G

    A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014

  13. [13]

    J. Lin, C. Gan, and S. Han. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 7083–7093, 2019

  14. [14]

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021

  15. [15]

    P. Luc, A. Clark, S. Dieleman, D. d. L. Casas, Y . Doron, A. Cassirer, and K. Simonyan. Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035, 2020

  16. [16]

    Miech, D

    A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 2630–2640, 2019. 10

  17. [17]

    Latent video transformer

    R. Rakhimov, D. V olkhonskiy, A. Artemov, D. Zorin, and E. Burnaev. Latent video transformer. arXiv preprint arXiv:2006.10704, 2020

  18. [18]

    Zero-Shot Text-to-Image Generation

    A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. V oss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021

  19. [19]

    Saito, E

    M. Saito, E. Matsumoto, and S. Saito. Temporal generative adversarial nets with singular value clipping. In Proceedings of the IEEE international conference on computer vision , pages 2830–2839, 2017

  20. [20]

    Saito, S

    M. Saito, S. Saito, M. Koyama, and S. Kobayashi. Train sparsely, generate densely: Memory- efficient unsupervised training of high-resolution temporal gan. International Journal of Computer Vision, 128(10):2586–2606, 2020

  21. [21]

    Salimans, I

    T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 2234–2242, 2016

  22. [22]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

  23. [23]

    Sutskever, J

    I. Sutskever, J. Martens, and G. Hinton. Generating text with recurrent neural networks. In ICML’11, page 1017–1024, 2011

  24. [24]

    Y . Tian, J. Ren, M. Chai, K. Olszewski, X. Peng, D. N. Metaxas, and S. Tulyakov. A good image generator is what you need for high-resolution video synthesis.arXiv preprint arXiv:2104.15069, 2021

  25. [25]

    D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015

  26. [26]

    Tulyakov, M.-Y

    S. Tulyakov, M.-Y . Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018

  27. [27]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. To- wards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

  28. [28]

    van den Oord, O

    A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems , pages 6309–6318, 2017

  29. [29]

    Attention Is All You Need

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017

  30. [30]

    V ondrick, H

    C. V ondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics.Advances in neural information processing systems , 29, 2016

  31. [31]

    X. Wang, J. Wu, J. Chen, L. Li, Y .-F. Wang, and W. Y . Wang. Vatex: A large-scale, high- quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591, 2019

  32. [32]

    Y . Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. Advances in neural information processing systems, 30, 2017

  33. [33]

    Scaling Autoregressive Video Models

    D. Weissenborn, O. Täckström, and J. Uszkoreit. Scaling autoregressive video models. arXiv preprint arXiv:1906.02634, 2019

  34. [34]

    C. Wu, L. Huang, Q. Zhang, B. Li, L. Ji, F. Yang, G. Sapiro, and N. Duan. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021. 11

  35. [35]

    C. Wu, J. Liang, L. Ji, F. Yang, Y . Fang, D. Jiang, and N. Duan. N\" uwa: Visual synthesis pre-training for neural visual world creation. arXiv preprint arXiv:2111.12417, 2021

  36. [36]

    W. Yan, Y . Zhang, P. Abbeel, and A. Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021

  37. [37]

    S. Yu, J. Tack, S. Mo, H. Kim, J. Kim, J.-W. Ha, and J. Shin. Generating videos with dynamics- aware implicit generative adversarial networks. arXiv preprint arXiv:2202.10571, 2022. A Attention Analysis To explore the attention mechanism of dual-channel attention, we visualize (1) the attention distribu- tion in the temporal channel and (2) the mixture fa...