pith. machine review for the scientific record. sign in

arxiv: 2312.14125 · v4 · submitted 2023-12-21 · 💻 cs.CV · cs.AI

Recognition: no theorem link

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video generationzero-shot learningdecoder-only transformermultimodal modelingautoregressive pretrainingvideo synthesislanguage model
0
0 comments X

The pith

A decoder-only transformer trained on multimodal data generates high-fidelity videos zero-shot from text, images, and audio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VideoPoet as a language model that synthesizes video and matching audio from diverse inputs such as text prompts, images, and audio clips. It adapts the standard LLM training recipe of autoregressive pretraining followed by task adaptation, but applies it to a mixture of generative objectives across modalities inside a single decoder-only transformer. This setup is meant to show that the same architecture and training protocol used for text can transfer directly to video synthesis without heavy video-specific modifications. If the approach holds, it would mean video generation can be treated as another language-modeling task rather than requiring entirely separate model families.

Core claim

VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs including images, videos, text, and audio. The training protocol follows that of Large Language Models, consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks, achieving state-of-the-art zero-shot capabilities especially in high-fidelity motion generation.

What carries the argument

Decoder-only transformer trained autoregressively on a mixture of multimodal generative objectives for pretraining and adaptation.

If this is right

  • Zero-shot video generation becomes possible from arbitrary combinations of text, image, and audio conditioning.
  • The same pretrained model can be adapted to multiple downstream video tasks without retraining from scratch.
  • Audio is generated jointly with video frames rather than added in a separate step.
  • High-fidelity motion synthesis emerges as a direct result of the autoregressive multimodal pretraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If successful, the approach could allow reuse of existing large-scale language-model training pipelines for visual generation.
  • Longer or more narrative videos might become feasible by extending the context window of the same transformer.
  • It raises the question of whether explicit temporal modeling layers remain necessary once sufficient multimodal data is available.

Load-bearing premise

The mixture of multimodal autoregressive objectives will transfer to coherent zero-shot video generation without needing substantial video-specific architectural biases.

What would settle it

Side-by-side evaluation on standard video benchmarks where VideoPoet produces visibly lower motion coherence or visual quality than specialized video diffusion models.

read the original abstract

We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces VideoPoet, a decoder-only transformer language model for zero-shot video generation. It processes multimodal conditioning signals (text, images, videos, audio) and is trained in two stages: pretraining via a mixture of autoregressive generative objectives across modalities, followed by task-specific adaptation. The central empirical claim is that this yields state-of-the-art zero-shot video synthesis with high-fidelity motion and matching audio.

Significance. If the quantitative results hold, the work would demonstrate that standard LLM scaling and autoregressive pretraining can transfer effectively to high-quality video generation without heavy video-specific inductive biases. This would be a notable unification of multimodal generation under the decoder-only paradigm and could simplify future pipelines, though the absence of detailed baselines, model scale, and dataset descriptions in the provided text limits immediate assessment of impact.

major comments (2)
  1. [Abstract] Abstract: the assertion of 'state-of-the-art capabilities' and 'high-fidelity motions' is presented without any supporting quantitative evidence (e.g., FID, FVD, human preference scores, or direct comparisons to prior models such as Make-A-Video). This is load-bearing for the headline claim and must be addressed with tables or figures in the main text.
  2. [Abstract] The manuscript provides no information on model size (parameters), training data volume or composition, or exact autoregressive objectives used in pretraining. These details are required to evaluate whether the zero-shot transfer result is reproducible or merely consistent with scaling trends.
minor comments (1)
  1. [Abstract] The project page URL is given but no corresponding reference or citation is provided for the supplementary materials or videos.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of results and reproducibility details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of 'state-of-the-art capabilities' and 'high-fidelity motions' is presented without any supporting quantitative evidence (e.g., FID, FVD, human preference scores, or direct comparisons to prior models such as Make-A-Video). This is load-bearing for the headline claim and must be addressed with tables or figures in the main text.

    Authors: We agree that the abstract claims require clear quantitative support in the main text. The full manuscript contains these evaluations in the experiments section, including FVD metrics, human preference scores, and direct comparisons against Make-A-Video and other baselines. To address the concern, we have added a summary table of key zero-shot results (FVD, human studies) to the main body immediately following the introduction, ensuring the SOTA claims are explicitly grounded. revision: yes

  2. Referee: [Abstract] The manuscript provides no information on model size (parameters), training data volume or composition, or exact autoregressive objectives used in pretraining. These details are required to evaluate whether the zero-shot transfer result is reproducible or merely consistent with scaling trends.

    Authors: We acknowledge the importance of these details for assessing the work. The manuscript describes the decoder-only transformer architecture and the two-stage pretraining plus adaptation protocol with a mixture of autoregressive objectives, but we have expanded the methods section to explicitly report the model scale in parameters, the training data volume and composition (including sources for video, image, text, and audio), and the precise set of generative objectives (e.g., next-token prediction on tokenized multimodal sequences). These additions improve reproducibility without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical training results

full rationale

The paper presents VideoPoet as an empirical LLM trained autoregressively on multimodal data (text, image, video, audio) using a standard decoder-only transformer. Claims of zero-shot video generation and high-fidelity motion rest on training outcomes and evaluations, not on any derivation chain, fitted parameter renamed as prediction, or self-citation that reduces the result to its inputs by construction. No equations or uniqueness theorems are invoked that loop back to the model's own outputs; the protocol follows established LLM scaling literature without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard transformer assumptions and the effectiveness of LLM pretraining for new modalities, with no explicit free parameters listed in the abstract.

axioms (1)
  • domain assumption Autoregressive modeling of multimodal sequences can capture video dynamics effectively.
    Invoked in the pretraining stage description.

pith-pipeline@v0.9.0 · 5562 in / 1175 out tokens · 64353 ms · 2026-05-15T17:46:28.066693+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

    cs.CV 2026-05 conditional novelty 7.0

    LiteLVLM prunes visual tokens for pixel grounding by reversing CLIP visual-text similarity to retain referent region tokens, outperforming prior methods by over 5% with 22% speedup and 2.3x memory reduction without an...

  2. MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production

    cs.MM 2026-04 unverdicted novelty 7.0

    MCSC-Bench is the first large-scale dataset for the Multimodal Context-to-Script Creation task, requiring models to select relevant shots from redundant materials, plan missing shots, and generate coherent scripts wit...

  3. FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation

    cs.CV 2026-03 unverdicted novelty 7.0

    FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.

  4. Stream-T1: Test-Time Scaling for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...

  5. Animator-Centric Skeleton Generation on Objects with Fine-Grained Details

    cs.GR 2026-04 unverdicted novelty 6.0

    An animator-centric skeleton generation method that uses semantic-aware tokenization and a learnable density interval module to produce controllable, high-quality skeletons on complex 3D meshes.

  6. DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer

    cs.CV 2026-04 unverdicted novelty 6.0

    RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and ...

  7. Latent-Compressed Variational Autoencoder for Video Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.

  8. Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

    cs.RO 2026-04 unverdicted novelty 6.0

    Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.

  9. MMaDA: Multimodal Large Diffusion Language Models

    cs.CV 2025-05 unverdicted novelty 6.0

    MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-im...

  10. MAGI-1: Autoregressive Video Generation at Scale

    cs.CV 2025-05 unverdicted novelty 6.0

    MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

  11. Emu3: Next-Token Prediction is All You Need

    cs.CV 2024-09 unverdicted novelty 6.0

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  12. Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

    cs.RO 2024-09 unverdicted novelty 6.0

    Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.

  13. CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.

  14. Movie Gen: A Cast of Media Foundation Models

    cs.CV 2024-10 unverdicted novelty 5.0

    A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

  15. Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    cs.CV 2024-08 unverdicted novelty 5.0

    Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.

  16. Show-o2: Improved Native Unified Multimodal Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

  17. VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    cs.CV 2024-06 unverdicted novelty 4.0

    VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

  18. Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

    cs.CV 2025-03 unverdicted novelty 2.0

    The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

  19. Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    cs.CV 2024-02 unverdicted novelty 2.0

    The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 19 Pith papers · 26 internal anchors

  1. [1]

    MusicLM: Generating Music From Text

    Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325,

  2. [2]

    Alternating gradient descent and mixture-of- experts for integrated multimodal perception

    Akbari, H., Kondratyuk, D., Cui, Y ., Hornung, R., Wang, H., and Adam, H. Alternating gradient descent and mixture-of- experts for integrated multimodal perception. arXiv preprint arXiv:2305.06324,

  3. [3]

    PaLM 2 Technical Report

    Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403,

  4. [4]

    Lumiere: A space-time diffusion model for video generation

    Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Li, Y ., Michaeli, T., et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945,

  5. [5]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a. Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., and Kreis, K. Align yo...

  6. [6]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. NeurIPS, 33: 1877–1901,

  7. [7]

    A Short Note about Kinetics-600

    Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., and Zis- serman, A. A short note about kinetics-600. arXiv preprint arXiv:1808.01340,

  8. [8]

    Chang, H

    Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.-H., Murphy, K., Freeman, W. T., Rubinstein, M., et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704,

  9. [9]

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    Chen, H., Xia, M., He, Y ., Zhang, Y ., Cun, X., Yang, S., Xing, J., Liu, Y ., Chen, Q., Wang, X., et al. Videocrafter1: Open diffu- sion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a. Chen, W., Wu, J., Xie, P., Wu, H., Li, J., Xia, X., Xiao, X., and Lin, L. Control-a-video: Controllable text-to-video generation with dif...

  10. [10]

    PaLM: Scaling Language Modeling with Pathways

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. PaLM: Scaling language modeling with pathways. arXiv:2204.02311,

  11. [11]

    PaLM-E: An Embodied Multimodal Language Model

    Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. Palm- e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378,

  12. [12]

    Ccedit: Creative and controllable video editing via diffusion models

    Feng, R., Weng, W., Wang, Y ., Yuan, Y ., Bao, J., Luo, C., Chen, Z., and Guo, B. Ccedit: Creative and controllable video editing via diffusion models. arXiv preprint arXiv:2309.16496,

  13. [13]

    Tokenflow: Consistent diffusion features for consistent video editing

    Geyer, M., Bar-Tal, O., Bagon, S., and Dekel, T. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373,

  14. [14]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Guo, Y ., Yang, C., Rao, A., Wang, Y ., Qiao, Y ., Lin, D., and Dai, B. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725,

  15. [15]

    Maskvit: Masked visual pre-training for video prediction.arXiv preprint arXiv:2206.11894, 2022

    Gupta, A., Tian, S., Zhang, Y ., Wu, J., Mart´ın-Mart´ın, R., and Fei- Fei, L. Maskvit: Masked visual pre-training for video prediction. arXiv preprint arXiv:2206.11894,

  16. [16]

    Photorealistic video generation with diffusion models

    10 VideoPoet: A Large Language Model for Zero-Shot Video Generation Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Fei-Fei, L., Essa, I., Jiang, L., and Lezama, J. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662,

  17. [17]

    Latent Video Diffusion Models for High-Fidelity Long Video Generation

    He, Y ., Yang, T., Zhang, Y ., Shan, Y ., and Chen, Q. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2(3):4,

  18. [18]

    URL https://arxiv.org/abs/1609.09430. Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

  19. [19]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a. Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., and Fleet, D. J. Video diffusion models. arXiv:2204.03458, 20...

  20. [20]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Hong, W., Ding, M., Zheng, W., Liu, X., and Tang, J. Cogvideo: Large-scale pretraining for text-to-video generation via trans- formers. arXiv preprint arXiv:2205.15868,

  21. [21]

    GAIA-1: A Generative World Model for Autonomous Driving

    Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., and Corrado, G. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080 ,

  22. [22]

    StarCoder: may the source be with you!

    Li, R., Allal, L. B., Zi, Y ., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. StarCoder: may the source be with you! arXiv:2305.06161,

  23. [23]

    H., Yan, H., Zhang, J., Xu, Z., and Feng, J

    Liew, J. H., Yan, H., Zhang, J., Xu, Z., and Feng, J. Magicedit: High-fidelity and temporally coherent video editing. arXiv preprint arXiv:2308.14749,

  24. [24]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Meng, C., He, Y ., Song, Y ., Song, J., Wu, J., Zhu, J.-Y ., and Ermon, S. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073,

  25. [25]

    Transframer: Arbitrary frame prediction with generative models

    Nash, C., Carreira, J., Walker, J., Barr, I., Jaegle, A., Malinowski, M., and Battaglia, P. Transframer: Arbitrary frame prediction with generative models. arXiv preprint arXiv:2203.09494,

  26. [26]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. arXiv:2303.08774,

  27. [27]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    URL https://pika.art/launch. Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M¨uller, J., Penna, J., and Rombach, R. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952,

  28. [28]

    Zero-Shot Text-to-Image Generation

    Ramesh, A., Pavlov, M., Goh, G., Gray, S., V oss, C., Radford, A., Chen, M., and Sutskever, I. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092,

  29. [29]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3,

  30. [30]

    K., Asawaroengchai, C., Nguyen, D

    Rubenstein, P. K., Asawaroengchai, C., Nguyen, D. D., Bapna, A., Borsos, Z., Quitry, F. d. C., Chen, P., Badawy, D. E., Han, W., Kharitonov, E., et al. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925,

  31. [31]

    A step toward more inclusive people annotations for fairness

    Schumann, C., Ricco, S., Prabhu, U., Ferrari, V ., and Pantofaru, C. A step toward more inclusive people annotations for fairness. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 916–925,

  32. [32]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    URL https://openreview.net/forum? id=L9I9FhHfS3. Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al. Make-a-video: Text- to-video generation without text-video data. arXiv preprint arXiv:2209.14792,

  33. [33]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402,

  34. [34]

    Any- to-any generation via composable diffusion

    11 VideoPoet: A Large Language Model for Zero-Shot Video Generation Tang, Z., Yang, Z., Zhu, C., Zeng, M., and Bansal, M. Any- to-any generation via composable diffusion. arXiv preprint arXiv:2305.11846,

  35. [35]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., and Gelly, S. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717,

  36. [36]

    T., Castro, S., Kunze, J., and Erhan, D

    Villegas, R., Babaeizadeh, M., Kindermans, P.-J., Moraldo, H., Zhang, H., Saffar, M. T., Castro, S., Kunze, J., and Erhan, D. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399,

  37. [37]

    ModelScope Text-to-Video Technical Report

    Wang, J., Yuan, H., Chen, D., Zhang, Y ., Wang, X., and Zhang, S. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a. Wang, W., Xie, K., Liu, Z., Chen, H., Cao, Y ., Wang, X., and Shen, C. Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023b. Wang, W., Yang, H., Tuo, Z., ...

  38. [38]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Yan, W., Zhang, Y ., Abbeel, P., and Srinivas, A. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157,

  39. [39]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Yu, J., Xu, Y ., Koh, J. Y ., Luong, T., Baid, G., Wang, Z., Vasudevan, V ., Ku, A., Yang, Y ., Ayan, B. K., et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789,

  40. [40]

    Make pixels dance: High-dynamic video generation

    Zeng, Y ., Wei, G., Zheng, J., Zou, J., Wei, Y ., Zhang, Y ., and Li, H. Make pixels dance: High-dynamic video generation. arXiv preprint arXiv:2311.10982,

  41. [41]

    J., Wu, J

    Zhang, D. J., Wu, J. Z., Liu, J.-W., Zhao, R., Ran, L., Gu, Y ., Gao, D., and Shou, M. Z. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a. Zhang, L., Rao, A., and Agrawala, M. Adding conditional control to text-to-image diffusion models. In CVPR, pp. 3836–3847, 2023b. Zhang, Y ., Jian...

  42. [42]

    MagicVideo: Efficient Video Generation With Latent Diffusion Models

    Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y ., and Feng, J. Mag- icvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018,

  43. [43]

    a {profession or people descriptor} looking {adverb} at the camera

    12 VideoPoet: A Large Language Model for Zero-Shot Video Generation A. Appendix A.1. Responsible AI and Fairness Analysis We evaluate whether the generated outputs of our model are fair regarding protected attributes such as (1) Perceived Age (2) Perceived Gender Expression (3) Perceived Skin Tone. We construct 306 prompts with template — “a {profession o...

  44. [44]

    Both FVD and FAD metrics are calculated using a held-out subset of 25 thousand videos

    and audio generation quality using the Fr´echet Audio Distance (FAD), which uses the VGGish model as the embedding function (Hershey et al., 2017). Both FVD and FAD metrics are calculated using a held-out subset of 25 thousand videos. Fig. 8 shows that as the model size grows and the amount of training data increases, performance improves across visual an...

  45. [45]

    one by one

    Appendix A.2.1 shows a qualitative comparison of the 1B and 8B pretrained models. Increasing the model size improved temporal consistency, prompt fidelity, and motion dynamics while adding capabilities for limited text rendering, spatial understanding, and counting. (a) Video generation quality in FVD (↓). (b) Audio generation quality in FAD (↓). Figure 8...

  46. [46]

    content” or appearance of the output and the optical flow and depth control the “structure

    to predict videos from the combination of text, optical flow, and depth signals. On a subset of steps, we also condition on the first video frame. As described in (Esser et al., 2023), the text will generally define the “content” or appearance of the output and the optical flow and depth control the “structure.” In contrast to the diffusion-based approach...

  47. [47]

    and produce monocular depth maps from MIDAS (Ranftl et al., 2020), and then normalize and concatenate on the channel dimension. This conveniently produces the 15 VideoPoet: A Large Language Model for Zero-Shot Video Generation Camera Motion: Arc shot Camera Motion: FPV drone shot Figure 11: Examples of directed camera movement from the same initial frame....

  48. [48]

    For more details, please refer to Appendix A.5.7

    dataset and provide 2 style prompts for each video. For more details, please refer to Appendix A.5.7. Following (Esser et al., 2023), we evaluated the CLIP-embedding consistency between each frame and the text prompt to determine if the stylization results matches the text. As shown in Table 3, VideoPoet outperforms Control-A-Video conditioned on depth by...

  49. [49]

    in the second stage, respectively (Fig. 3). The cross-attention layers attend to local windows in the low-resolution sequence isomorphic to self-attention windows but with half the spatial size. We train the super-resolution stages on a dataset of 64M high-quality text-video pairs using the masked modeling objective of MAGVIT (Yu et al., 2023a), with toke...

  50. [50]

    scale (Ho & Salimans, 2022; Brooks et al.,

    with 24 sampling steps for each stage and classifier-free guidance 17 VideoPoet: A Large Language Model for Zero-Shot Video Generation Figure 13: Example screenshot of the user interface for human side-by-side comparisons. scale (Ho & Salimans, 2022; Brooks et al.,

  51. [51]

    (2022), measure FVD (Unterthiner et al.,

    following an implementation given by Villegas et al. (2022), measure FVD (Unterthiner et al.,

  52. [52]

    a still shot of an ugly cartoon, slideshow of an empty scene, low resolution, distorted and disfigured

    following Yu et al. (2023a) on UCF101 dataset and following Zhang et al. (2023a) on MSR-VTT, and measure Inception Score (IS) (Saito et al., 2020). When the evaluation protocol is on 16 frames, we discard the generated last frame to make a 16-frame video. A.5.4. A DDITIONAL HUMAN EVALUATION DETAILS Figure 13 shows an example screenshot of our side-by-side...

  53. [53]

    Zero-shot MSR-VTT

    We first generate videos of 128 x 128 resolution and then resize to 256 x 256 via bicubic upsampling. Zero-shot MSR-VTT. For CLIP score, we used all 59,794 captions from the MSR-VTT test set. We use CLIP ViT-B/16 model following Phenaki (Villegas et al., 2022). We note that some papers use other CLIP models,e.g., VideoLDM (Blattmann et al., 2023b) uses Vi...

  54. [54]

    To compute the FVD real features, we sample 10K videos from the training set, following TGAN2 (Saito et al., 2020)

    18 VideoPoet: A Large Language Model for Zero-Shot Video Generation to represent the 101 categories. To compute the FVD real features, we sample 10K videos from the training set, following TGAN2 (Saito et al., 2020). We sample the central 16 frames for each real video , without any temporal downsampling, i.e., we use the original fps in the UCF-101 datase...

  55. [55]

    We follow MAGVIT (Yu et al., 2023a) in evaluating these tasks against the respective real distribution, using 50000×4 samples for K600 and 50000 samples for SSv2

    is used as the primary metric, calculated with 16 frames at 128×128 resolution. We follow MAGVIT (Yu et al., 2023a) in evaluating these tasks against the respective real distribution, using 50000×4 samples for K600 and 50000 samples for SSv2. A.5.7. S TYLIZATION EVALUATION ON DAVIS To evaluate the CLIP similarity score and human preference on video styliz...