arxiv: 2312.14125 · v4 · submitted 2023-12-21 · 💻 cs.CV · cs.AI

Recognition: no theorem link

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk , Lijun Yu , Xiuye Gu , Jos\'e Lezama , Jonathan Huang , Grant Schindler , Rachel Hornung , Vighnesh Birodkar

show 23 more authors

Jimmy Yan Ming-Chang Chiu Krishna Somandepalli Hassan Akbari Yair Alon Yong Cheng Josh Dillon Agrim Gupta Meera Hahn Anja Hauth David Hendon Alonso Martinez David Minnen Mikhail Sirotenko Kihyuk Sohn Xuan Yang Hartwig Adam Ming-Hsuan Yang Irfan Essa Huisheng Wang David A. Ross Bryan Seybold Lu Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video generationzero-shot learningdecoder-only transformermultimodal modelingautoregressive pretrainingvideo synthesislanguage model

0 comments

The pith

A decoder-only transformer trained on multimodal data generates high-fidelity videos zero-shot from text, images, and audio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VideoPoet as a language model that synthesizes video and matching audio from diverse inputs such as text prompts, images, and audio clips. It adapts the standard LLM training recipe of autoregressive pretraining followed by task adaptation, but applies it to a mixture of generative objectives across modalities inside a single decoder-only transformer. This setup is meant to show that the same architecture and training protocol used for text can transfer directly to video synthesis without heavy video-specific modifications. If the approach holds, it would mean video generation can be treated as another language-modeling task rather than requiring entirely separate model families.

Core claim

VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs including images, videos, text, and audio. The training protocol follows that of Large Language Models, consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks, achieving state-of-the-art zero-shot capabilities especially in high-fidelity motion generation.

What carries the argument

Decoder-only transformer trained autoregressively on a mixture of multimodal generative objectives for pretraining and adaptation.

If this is right

Zero-shot video generation becomes possible from arbitrary combinations of text, image, and audio conditioning.
The same pretrained model can be adapted to multiple downstream video tasks without retraining from scratch.
Audio is generated jointly with video frames rather than added in a separate step.
High-fidelity motion synthesis emerges as a direct result of the autoregressive multimodal pretraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If successful, the approach could allow reuse of existing large-scale language-model training pipelines for visual generation.
Longer or more narrative videos might become feasible by extending the context window of the same transformer.
It raises the question of whether explicit temporal modeling layers remain necessary once sufficient multimodal data is available.

Load-bearing premise

The mixture of multimodal autoregressive objectives will transfer to coherent zero-shot video generation without needing substantial video-specific architectural biases.

What would settle it

Side-by-side evaluation on standard video benchmarks where VideoPoet produces visibly lower motion coherence or visual quality than specialized video diffusion models.

read the original abstract

We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VideoPoet shows a decoder-only transformer can unify multimodal inputs for zero-shot video generation, but the abstract leaves the SOTA claims without supporting numbers or comparisons.

read the letter

VideoPoet trains a decoder-only transformer on a mix of text, image, video, and audio data using standard autoregressive objectives, then adapts it for generation tasks. The main result is zero-shot video output with audio and decent motion quality from varied conditioning signals like text or images. That unified setup is the concrete step forward here. It avoids separate models or heavy task-specific engineering by treating everything as next-token prediction in one backbone. The two-stage protocol (pretrain then adapt) follows the LLM playbook directly, which keeps the method simple and leverages existing scaling lessons. The paper does a clean job describing how the multimodal objectives are mixed during pretraining and how the model handles different input types without extra heads. That flexibility is useful for real applications where users might mix prompts. The stress-test note is right that nothing in the stated protocol contradicts the headline claim at the level of empirical demonstration. The soft spot is the lack of detail in the abstract on model size, dataset scale, tokenization choices, or quantitative benchmarks. Claims of state-of-the-art motion fidelity are asserted but not shown with tables or baselines, so it is hard to judge how much the approach actually improves on prior video models. If the full paper has those numbers and ablations, the work becomes more convincing; without them the transfer story stays plausible but untested in the provided text. This paper is for people already working on multimodal generative models who want to see LLM methods applied to video. A reader interested in zero-shot synthesis or unified architectures will get value from the protocol description even if the results need more scrutiny. It deserves a serious referee because the architecture and training recipe are coherent and the target problem is timely; reviewers can check the actual metrics and reproducibility details.

Referee Report

2 major / 1 minor

Summary. The paper introduces VideoPoet, a decoder-only transformer language model for zero-shot video generation. It processes multimodal conditioning signals (text, images, videos, audio) and is trained in two stages: pretraining via a mixture of autoregressive generative objectives across modalities, followed by task-specific adaptation. The central empirical claim is that this yields state-of-the-art zero-shot video synthesis with high-fidelity motion and matching audio.

Significance. If the quantitative results hold, the work would demonstrate that standard LLM scaling and autoregressive pretraining can transfer effectively to high-quality video generation without heavy video-specific inductive biases. This would be a notable unification of multimodal generation under the decoder-only paradigm and could simplify future pipelines, though the absence of detailed baselines, model scale, and dataset descriptions in the provided text limits immediate assessment of impact.

major comments (2)

[Abstract] Abstract: the assertion of 'state-of-the-art capabilities' and 'high-fidelity motions' is presented without any supporting quantitative evidence (e.g., FID, FVD, human preference scores, or direct comparisons to prior models such as Make-A-Video). This is load-bearing for the headline claim and must be addressed with tables or figures in the main text.
[Abstract] The manuscript provides no information on model size (parameters), training data volume or composition, or exact autoregressive objectives used in pretraining. These details are required to evaluate whether the zero-shot transfer result is reproducible or merely consistent with scaling trends.

minor comments (1)

[Abstract] The project page URL is given but no corresponding reference or citation is provided for the supplementary materials or videos.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of results and reproducibility details.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion of 'state-of-the-art capabilities' and 'high-fidelity motions' is presented without any supporting quantitative evidence (e.g., FID, FVD, human preference scores, or direct comparisons to prior models such as Make-A-Video). This is load-bearing for the headline claim and must be addressed with tables or figures in the main text.

Authors: We agree that the abstract claims require clear quantitative support in the main text. The full manuscript contains these evaluations in the experiments section, including FVD metrics, human preference scores, and direct comparisons against Make-A-Video and other baselines. To address the concern, we have added a summary table of key zero-shot results (FVD, human studies) to the main body immediately following the introduction, ensuring the SOTA claims are explicitly grounded. revision: yes
Referee: [Abstract] The manuscript provides no information on model size (parameters), training data volume or composition, or exact autoregressive objectives used in pretraining. These details are required to evaluate whether the zero-shot transfer result is reproducible or merely consistent with scaling trends.

Authors: We acknowledge the importance of these details for assessing the work. The manuscript describes the decoder-only transformer architecture and the two-stage pretraining plus adaptation protocol with a mixture of autoregressive objectives, but we have expanded the methods section to explicitly report the model scale in parameters, the training data volume and composition (including sources for video, image, text, and audio), and the precise set of generative objectives (e.g., next-token prediction on tokenized multimodal sequences). These additions improve reproducibility without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical training results

full rationale

The paper presents VideoPoet as an empirical LLM trained autoregressively on multimodal data (text, image, video, audio) using a standard decoder-only transformer. Claims of zero-shot video generation and high-fidelity motion rest on training outcomes and evaluations, not on any derivation chain, fitted parameter renamed as prediction, or self-citation that reduces the result to its inputs by construction. No equations or uniqueness theorems are invoked that loop back to the model's own outputs; the protocol follows established LLM scaling literature without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard transformer assumptions and the effectiveness of LLM pretraining for new modalities, with no explicit free parameters listed in the abstract.

axioms (1)

domain assumption Autoregressive modeling of multimodal sequences can capture video dynamics effectively.
Invoked in the pretraining stage description.

pith-pipeline@v0.9.0 · 5562 in / 1175 out tokens · 64353 ms · 2026-05-15T17:46:28.066693+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models
cs.CV 2026-05 conditional novelty 7.0

LiteLVLM prunes visual tokens for pixel grounding by reversing CLIP visual-text similarity to retain referent region tokens, outperforming prior methods by over 5% with 22% speedup and 2.3x memory reduction without an...
MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production
cs.MM 2026-04 unverdicted novelty 7.0

MCSC-Bench is the first large-scale dataset for the Multimodal Context-to-Script Creation task, requiring models to select relevant shots from redundant materials, plan missing shots, and generate coherent scripts wit...
FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation
cs.CV 2026-03 unverdicted novelty 7.0

FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
Stream-T1: Test-Time Scaling for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
Animator-Centric Skeleton Generation on Objects with Fine-Grained Details
cs.GR 2026-04 unverdicted novelty 6.0

An animator-centric skeleton generation method that uses semantic-aware tokenization and a learnable density interval module to produce controllable, high-quality skeletons on complex 3D meshes.
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
cs.CV 2026-04 unverdicted novelty 6.0

RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and ...
Latent-Compressed Variational Autoencoder for Video Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?
cs.RO 2026-04 unverdicted novelty 6.0

Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
MMaDA: Multimodal Large Diffusion Language Models
cs.CV 2025-05 unverdicted novelty 6.0

MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-im...
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
cs.RO 2024-09 unverdicted novelty 6.0

Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
cs.CV 2024-04 unverdicted novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
cs.CV 2024-08 unverdicted novelty 5.0

Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.
Show-o2: Improved Native Unified Multimodal Models
cs.CV 2025-06 unverdicted novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
cs.CV 2024-06 unverdicted novelty 4.0

VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
cs.CV 2025-03 unverdicted novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
cs.CV 2024-02 unverdicted novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 19 Pith papers · 26 internal anchors

[1]

MusicLM: Generating Music From Text

Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Alternating gradient descent and mixture-of- experts for integrated multimodal perception

Akbari, H., Kondratyuk, D., Cui, Y ., Hornung, R., Wang, H., and Adam, H. Alternating gradient descent and mixture-of- experts for integrated multimodal perception. arXiv preprint arXiv:2305.06324,

work page arXiv
[3]

PaLM 2 Technical Report

Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Lumiere: A space-time diffusion model for video generation

Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Li, Y ., Michaeli, T., et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945,

work page arXiv
[5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a. Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., and Kreis, K. Align yo...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. NeurIPS, 33: 1877–1901,

work page 1901
[7]

A Short Note about Kinetics-600

Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., and Zis- serman, A. A short note about kinetics-600. arXiv preprint arXiv:1808.01340,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Chang, H

Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.-H., Murphy, K., Freeman, W. T., Rubinstein, M., et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704,

work page arXiv
[9]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Chen, H., Xia, M., He, Y ., Zhang, Y ., Cun, X., Yang, S., Xing, J., Liu, Y ., Chen, Q., Wang, X., et al. Videocrafter1: Open diffu- sion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a. Chen, W., Wu, J., Xie, P., Wu, H., Li, J., Xia, X., Xiao, X., and Lin, L. Control-a-video: Controllable text-to-video generation with dif...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

PaLM: Scaling Language Modeling with Pathways

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. PaLM: Scaling language modeling with pathways. arXiv:2204.02311,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

PaLM-E: An Embodied Multimodal Language Model

Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. Palm- e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Ccedit: Creative and controllable video editing via diffusion models

Feng, R., Weng, W., Wang, Y ., Yuan, Y ., Bao, J., Luo, C., Chen, Z., and Guo, B. Ccedit: Creative and controllable video editing via diffusion models. arXiv preprint arXiv:2309.16496,

work page arXiv
[13]

Tokenflow: Consistent diffusion features for consistent video editing

Geyer, M., Bar-Tal, O., Bagon, S., and Dekel, T. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373,

work page arXiv
[14]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Guo, Y ., Yang, C., Rao, A., Wang, Y ., Qiao, Y ., Lin, D., and Dai, B. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Maskvit: Masked visual pre-training for video prediction.arXiv preprint arXiv:2206.11894, 2022

Gupta, A., Tian, S., Zhang, Y ., Wu, J., Mart´ın-Mart´ın, R., and Fei- Fei, L. Maskvit: Masked visual pre-training for video prediction. arXiv preprint arXiv:2206.11894,

work page arXiv
[16]

Photorealistic video generation with diffusion models

10 VideoPoet: A Large Language Model for Zero-Shot Video Generation Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Fei-Fei, L., Essa, I., Jiang, L., and Lezama, J. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662,

work page arXiv
[17]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

He, Y ., Yang, T., Zhang, Y ., Shan, Y ., and Chen, Q. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2(3):4,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

URL https://arxiv.org/abs/1609.09430. Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Imagen Video: High Definition Video Generation with Diffusion Models

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a. Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., and Fleet, D. J. Video diffusion models. arXiv:2204.03458, 20...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Hong, W., Ding, M., Zheng, W., Liu, X., and Tang, J. Cogvideo: Large-scale pretraining for text-to-video generation via trans- formers. arXiv preprint arXiv:2205.15868,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

GAIA-1: A Generative World Model for Autonomous Driving

Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., and Corrado, G. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080 ,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

StarCoder: may the source be with you!

Li, R., Allal, L. B., Zi, Y ., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. StarCoder: may the source be with you! arXiv:2305.06161,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

H., Yan, H., Zhang, J., Xu, Z., and Feng, J

Liew, J. H., Yan, H., Zhang, J., Xu, Z., and Feng, J. Magicedit: High-fidelity and temporally coherent video editing. arXiv preprint arXiv:2308.14749,

work page arXiv
[24]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Meng, C., He, Y ., Song, Y ., Song, J., Wu, J., Zhu, J.-Y ., and Ermon, S. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Transframer: Arbitrary frame prediction with generative models

Nash, C., Carreira, J., Walker, J., Barr, I., Jaegle, A., Malinowski, M., and Battaglia, P. Transframer: Arbitrary frame prediction with generative models. arXiv preprint arXiv:2203.09494,

work page arXiv
[26]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

URL https://pika.art/launch. Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M¨uller, J., Penna, J., and Rombach, R. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Zero-Shot Text-to-Image Generation

Ramesh, A., Pavlov, M., Goh, G., Gray, S., V oss, C., Radford, A., Chen, M., and Sutskever, I. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

K., Asawaroengchai, C., Nguyen, D

Rubenstein, P. K., Asawaroengchai, C., Nguyen, D. D., Bapna, A., Borsos, Z., Quitry, F. d. C., Chen, P., Badawy, D. E., Han, W., Kharitonov, E., et al. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925,

work page arXiv
[31]

A step toward more inclusive people annotations for fairness

Schumann, C., Ricco, S., Prabhu, U., Ferrari, V ., and Pantofaru, C. A step toward more inclusive people annotations for fairness. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 916–925,

work page 2021
[32]

Make-A-Video: Text-to-Video Generation without Text-Video Data

URL https://openreview.net/forum? id=L9I9FhHfS3. Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al. Make-a-video: Text- to-video generation without text-video data. arXiv preprint arXiv:2209.14792,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Any- to-any generation via composable diffusion

11 VideoPoet: A Large Language Model for Zero-Shot Video Generation Tang, Z., Yang, Z., Zhu, C., Zeng, M., and Bansal, M. Any- to-any generation via composable diffusion. arXiv preprint arXiv:2305.11846,

work page arXiv
[35]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., and Gelly, S. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

T., Castro, S., Kunze, J., and Erhan, D

Villegas, R., Babaeizadeh, M., Kindermans, P.-J., Moraldo, H., Zhang, H., Saffar, M. T., Castro, S., Kunze, J., and Erhan, D. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399,

work page arXiv
[37]

ModelScope Text-to-Video Technical Report

Wang, J., Yuan, H., Chen, D., Zhang, Y ., Wang, X., and Zhang, S. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a. Wang, W., Xie, K., Liu, Z., Chen, H., Cao, Y ., Wang, X., and Shen, C. Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023b. Wang, W., Yang, H., Tuo, Z., ...

work page internal anchor Pith review Pith/arXiv arXiv
[38]

VideoGPT: Video Generation using VQ-VAE and Transformers

Yan, W., Zhang, Y ., Abbeel, P., and Srinivas, A. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Yu, J., Xu, Y ., Koh, J. Y ., Luong, T., Baid, G., Wang, Z., Vasudevan, V ., Ku, A., Yang, Y ., Ayan, B. K., et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Make pixels dance: High-dynamic video generation

Zeng, Y ., Wei, G., Zheng, J., Zou, J., Wei, Y ., Zhang, Y ., and Li, H. Make pixels dance: High-dynamic video generation. arXiv preprint arXiv:2311.10982,

work page arXiv
[41]

J., Wu, J

Zhang, D. J., Wu, J. Z., Liu, J.-W., Zhao, R., Ran, L., Gu, Y ., Gao, D., and Shou, M. Z. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a. Zhang, L., Rao, A., and Agrawala, M. Adding conditional control to text-to-image diffusion models. In CVPR, pp. 3836–3847, 2023b. Zhang, Y ., Jian...

work page arXiv
[42]

MagicVideo: Efficient Video Generation With Latent Diffusion Models

Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y ., and Feng, J. Mag- icvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018,

work page internal anchor Pith review arXiv
[43]

a {profession or people descriptor} looking {adverb} at the camera

12 VideoPoet: A Large Language Model for Zero-Shot Video Generation A. Appendix A.1. Responsible AI and Fairness Analysis We evaluate whether the generated outputs of our model are fair regarding protected attributes such as (1) Perceived Age (2) Perceived Gender Expression (3) Perceived Skin Tone. We construct 306 prompts with template — “a {profession o...

work page 2021
[44]

Both FVD and FAD metrics are calculated using a held-out subset of 25 thousand videos

and audio generation quality using the Fr´echet Audio Distance (FAD), which uses the VGGish model as the embedding function (Hershey et al., 2017). Both FVD and FAD metrics are calculated using a held-out subset of 25 thousand videos. Fig. 8 shows that as the model size grows and the amount of training data increases, performance improves across visual an...

work page 2017
[45]

one by one

Appendix A.2.1 shows a qualitative comparison of the 1B and 8B pretrained models. Increasing the model size improved temporal consistency, prompt fidelity, and motion dynamics while adding capabilities for limited text rendering, spatial understanding, and counting. (a) Video generation quality in FVD (↓). (b) Audio generation quality in FAD (↓). Figure 8...

work page 2022
[46]

content” or appearance of the output and the optical flow and depth control the “structure

to predict videos from the combination of text, optical flow, and depth signals. On a subset of steps, we also condition on the first video frame. As described in (Esser et al., 2023), the text will generally define the “content” or appearance of the output and the optical flow and depth control the “structure.” In contrast to the diffusion-based approach...

work page 2023
[47]

and produce monocular depth maps from MIDAS (Ranftl et al., 2020), and then normalize and concatenate on the channel dimension. This conveniently produces the 15 VideoPoet: A Large Language Model for Zero-Shot Video Generation Camera Motion: Arc shot Camera Motion: FPV drone shot Figure 11: Examples of directed camera movement from the same initial frame....

work page 2020
[48]

For more details, please refer to Appendix A.5.7

dataset and provide 2 style prompts for each video. For more details, please refer to Appendix A.5.7. Following (Esser et al., 2023), we evaluated the CLIP-embedding consistency between each frame and the text prompt to determine if the stylization results matches the text. As shown in Table 3, VideoPoet outperforms Control-A-Video conditioned on depth by...

work page 2023
[49]

in the second stage, respectively (Fig. 3). The cross-attention layers attend to local windows in the low-resolution sequence isomorphic to self-attention windows but with half the spatial size. We train the super-resolution stages on a dataset of 64M high-quality text-video pairs using the masked modeling objective of MAGVIT (Yu et al., 2023a), with toke...

work page 2024
[50]

scale (Ho & Salimans, 2022; Brooks et al.,

with 24 sampling steps for each stage and classifier-free guidance 17 VideoPoet: A Large Language Model for Zero-Shot Video Generation Figure 13: Example screenshot of the user interface for human side-by-side comparisons. scale (Ho & Salimans, 2022; Brooks et al.,

work page 2022
[51]

(2022), measure FVD (Unterthiner et al.,

following an implementation given by Villegas et al. (2022), measure FVD (Unterthiner et al.,

work page 2022
[52]

a still shot of an ugly cartoon, slideshow of an empty scene, low resolution, distorted and disfigured

following Yu et al. (2023a) on UCF101 dataset and following Zhang et al. (2023a) on MSR-VTT, and measure Inception Score (IS) (Saito et al., 2020). When the evaluation protocol is on 16 frames, we discard the generated last frame to make a 16-frame video. A.5.4. A DDITIONAL HUMAN EVALUATION DETAILS Figure 13 shows an example screenshot of our side-by-side...

work page 2020
[53]

Zero-shot MSR-VTT

We first generate videos of 128 x 128 resolution and then resize to 256 x 256 via bicubic upsampling. Zero-shot MSR-VTT. For CLIP score, we used all 59,794 captions from the MSR-VTT test set. We use CLIP ViT-B/16 model following Phenaki (Villegas et al., 2022). We note that some papers use other CLIP models,e.g., VideoLDM (Blattmann et al., 2023b) uses Vi...

work page 2022
[54]

To compute the FVD real features, we sample 10K videos from the training set, following TGAN2 (Saito et al., 2020)

18 VideoPoet: A Large Language Model for Zero-Shot Video Generation to represent the 101 categories. To compute the FVD real features, we sample 10K videos from the training set, following TGAN2 (Saito et al., 2020). We sample the central 16 frames for each real video , without any temporal downsampling, i.e., we use the original fps in the UCF-101 datase...

work page 2020
[55]

We follow MAGVIT (Yu et al., 2023a) in evaluating these tasks against the respective real distribution, using 50000×4 samples for K600 and 50000 samples for SSv2

is used as the primary metric, calculated with 16 frames at 128×128 resolution. We follow MAGVIT (Yu et al., 2023a) in evaluating these tasks against the respective real distribution, using 50000×4 samples for K600 and 50000 samples for SSv2. A.5.7. S TYLIZATION EVALUATION ON DAVIS To evaluate the CLIP similarity score and human preference on video styliz...

work page 2016