Recognition: no theorem link
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Pith reviewed 2026-05-15 17:46 UTC · model grok-4.3
The pith
A decoder-only transformer trained on multimodal data generates high-fidelity videos zero-shot from text, images, and audio.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs including images, videos, text, and audio. The training protocol follows that of Large Language Models, consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks, achieving state-of-the-art zero-shot capabilities especially in high-fidelity motion generation.
What carries the argument
Decoder-only transformer trained autoregressively on a mixture of multimodal generative objectives for pretraining and adaptation.
If this is right
- Zero-shot video generation becomes possible from arbitrary combinations of text, image, and audio conditioning.
- The same pretrained model can be adapted to multiple downstream video tasks without retraining from scratch.
- Audio is generated jointly with video frames rather than added in a separate step.
- High-fidelity motion synthesis emerges as a direct result of the autoregressive multimodal pretraining.
Where Pith is reading between the lines
- If successful, the approach could allow reuse of existing large-scale language-model training pipelines for visual generation.
- Longer or more narrative videos might become feasible by extending the context window of the same transformer.
- It raises the question of whether explicit temporal modeling layers remain necessary once sufficient multimodal data is available.
Load-bearing premise
The mixture of multimodal autoregressive objectives will transfer to coherent zero-shot video generation without needing substantial video-specific architectural biases.
What would settle it
Side-by-side evaluation on standard video benchmarks where VideoPoet produces visibly lower motion coherence or visual quality than specialized video diffusion models.
read the original abstract
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VideoPoet, a decoder-only transformer language model for zero-shot video generation. It processes multimodal conditioning signals (text, images, videos, audio) and is trained in two stages: pretraining via a mixture of autoregressive generative objectives across modalities, followed by task-specific adaptation. The central empirical claim is that this yields state-of-the-art zero-shot video synthesis with high-fidelity motion and matching audio.
Significance. If the quantitative results hold, the work would demonstrate that standard LLM scaling and autoregressive pretraining can transfer effectively to high-quality video generation without heavy video-specific inductive biases. This would be a notable unification of multimodal generation under the decoder-only paradigm and could simplify future pipelines, though the absence of detailed baselines, model scale, and dataset descriptions in the provided text limits immediate assessment of impact.
major comments (2)
- [Abstract] Abstract: the assertion of 'state-of-the-art capabilities' and 'high-fidelity motions' is presented without any supporting quantitative evidence (e.g., FID, FVD, human preference scores, or direct comparisons to prior models such as Make-A-Video). This is load-bearing for the headline claim and must be addressed with tables or figures in the main text.
- [Abstract] The manuscript provides no information on model size (parameters), training data volume or composition, or exact autoregressive objectives used in pretraining. These details are required to evaluate whether the zero-shot transfer result is reproducible or merely consistent with scaling trends.
minor comments (1)
- [Abstract] The project page URL is given but no corresponding reference or citation is provided for the supplementary materials or videos.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of results and reproducibility details.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion of 'state-of-the-art capabilities' and 'high-fidelity motions' is presented without any supporting quantitative evidence (e.g., FID, FVD, human preference scores, or direct comparisons to prior models such as Make-A-Video). This is load-bearing for the headline claim and must be addressed with tables or figures in the main text.
Authors: We agree that the abstract claims require clear quantitative support in the main text. The full manuscript contains these evaluations in the experiments section, including FVD metrics, human preference scores, and direct comparisons against Make-A-Video and other baselines. To address the concern, we have added a summary table of key zero-shot results (FVD, human studies) to the main body immediately following the introduction, ensuring the SOTA claims are explicitly grounded. revision: yes
-
Referee: [Abstract] The manuscript provides no information on model size (parameters), training data volume or composition, or exact autoregressive objectives used in pretraining. These details are required to evaluate whether the zero-shot transfer result is reproducible or merely consistent with scaling trends.
Authors: We acknowledge the importance of these details for assessing the work. The manuscript describes the decoder-only transformer architecture and the two-stage pretraining plus adaptation protocol with a mixture of autoregressive objectives, but we have expanded the methods section to explicitly report the model scale in parameters, the training data volume and composition (including sources for video, image, text, and audio), and the precise set of generative objectives (e.g., next-token prediction on tokenized multimodal sequences). These additions improve reproducibility without altering the core claims. revision: yes
Circularity Check
No significant circularity; empirical training results
full rationale
The paper presents VideoPoet as an empirical LLM trained autoregressively on multimodal data (text, image, video, audio) using a standard decoder-only transformer. Claims of zero-shot video generation and high-fidelity motion rest on training outcomes and evaluations, not on any derivation chain, fitted parameter renamed as prediction, or self-citation that reduces the result to its inputs by construction. No equations or uniqueness theorems are invoked that loop back to the model's own outputs; the protocol follows established LLM scaling literature without circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Autoregressive modeling of multimodal sequences can capture video dynamics effectively.
Forward citations
Cited by 19 Pith papers
-
CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models
LiteLVLM prunes visual tokens for pixel grounding by reversing CLIP visual-text similarity to retain referent region tokens, outperforming prior methods by over 5% with 22% speedup and 2.3x memory reduction without an...
-
MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production
MCSC-Bench is the first large-scale dataset for the Multimodal Context-to-Script Creation task, requiring models to select relevant shots from redundant materials, plan missing shots, and generate coherent scripts wit...
-
FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation
FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
-
Stream-T1: Test-Time Scaling for Streaming Video Generation
Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
-
Animator-Centric Skeleton Generation on Objects with Fine-Grained Details
An animator-centric skeleton generation method that uses semantic-aware tokenization and a learnable density interval module to produce controllable, high-quality skeletons on complex 3D meshes.
-
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and ...
-
Latent-Compressed Variational Autoencoder for Video Diffusion Models
A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
-
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?
Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
-
MMaDA: Multimodal Large Diffusion Language Models
MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-im...
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
-
Emu3: Next-Token Prediction is All You Need
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
-
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
Movie Gen: A Cast of Media Foundation Models
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
-
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
-
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
-
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
Reference graph
Works this paper leans on
-
[1]
MusicLM: Generating Music From Text
Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Alternating gradient descent and mixture-of- experts for integrated multimodal perception
Akbari, H., Kondratyuk, D., Cui, Y ., Hornung, R., Wang, H., and Adam, H. Alternating gradient descent and mixture-of- experts for integrated multimodal perception. arXiv preprint arXiv:2305.06324,
-
[3]
Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Lumiere: A space-time diffusion model for video generation
Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Li, Y ., Michaeli, T., et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945,
-
[5]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a. Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., and Kreis, K. Align yo...
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. NeurIPS, 33: 1877–1901,
work page 1901
-
[7]
A Short Note about Kinetics-600
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., and Zis- serman, A. A short note about kinetics-600. arXiv preprint arXiv:1808.01340,
work page internal anchor Pith review Pith/arXiv arXiv
- [8]
-
[9]
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Chen, H., Xia, M., He, Y ., Zhang, Y ., Cun, X., Yang, S., Xing, J., Liu, Y ., Chen, Q., Wang, X., et al. Videocrafter1: Open diffu- sion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a. Chen, W., Wu, J., Xie, P., Wu, H., Li, J., Xia, X., Xiao, X., and Lin, L. Control-a-video: Controllable text-to-video generation with dif...
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
PaLM: Scaling Language Modeling with Pathways
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. PaLM: Scaling language modeling with pathways. arXiv:2204.02311,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
PaLM-E: An Embodied Multimodal Language Model
Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. Palm- e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Ccedit: Creative and controllable video editing via diffusion models
Feng, R., Weng, W., Wang, Y ., Yuan, Y ., Bao, J., Luo, C., Chen, Z., and Guo, B. Ccedit: Creative and controllable video editing via diffusion models. arXiv preprint arXiv:2309.16496,
-
[13]
Tokenflow: Consistent diffusion features for consistent video editing
Geyer, M., Bar-Tal, O., Bagon, S., and Dekel, T. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373,
-
[14]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Guo, Y ., Yang, C., Rao, A., Wang, Y ., Qiao, Y ., Lin, D., and Dai, B. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Maskvit: Masked visual pre-training for video prediction.arXiv preprint arXiv:2206.11894, 2022
Gupta, A., Tian, S., Zhang, Y ., Wu, J., Mart´ın-Mart´ın, R., and Fei- Fei, L. Maskvit: Masked visual pre-training for video prediction. arXiv preprint arXiv:2206.11894,
-
[16]
Photorealistic video generation with diffusion models
10 VideoPoet: A Large Language Model for Zero-Shot Video Generation Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Fei-Fei, L., Essa, I., Jiang, L., and Lezama, J. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662,
-
[17]
Latent Video Diffusion Models for High-Fidelity Long Video Generation
He, Y ., Yang, T., Zhang, Y ., Shan, Y ., and Chen, Q. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2(3):4,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
URL https://arxiv.org/abs/1609.09430. Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Imagen Video: High Definition Video Generation with Diffusion Models
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a. Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., and Fleet, D. J. Video diffusion models. arXiv:2204.03458, 20...
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Hong, W., Ding, M., Zheng, W., Liu, X., and Tang, J. Cogvideo: Large-scale pretraining for text-to-video generation via trans- formers. arXiv preprint arXiv:2205.15868,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
GAIA-1: A Generative World Model for Autonomous Driving
Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., and Corrado, G. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
StarCoder: may the source be with you!
Li, R., Allal, L. B., Zi, Y ., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. StarCoder: may the source be with you! arXiv:2305.06161,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
H., Yan, H., Zhang, J., Xu, Z., and Feng, J
Liew, J. H., Yan, H., Zhang, J., Xu, Z., and Feng, J. Magicedit: High-fidelity and temporally coherent video editing. arXiv preprint arXiv:2308.14749,
-
[24]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Meng, C., He, Y ., Song, Y ., Song, J., Wu, J., Zhu, J.-Y ., and Ermon, S. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Transframer: Arbitrary frame prediction with generative models
Nash, C., Carreira, J., Walker, J., Barr, I., Jaegle, A., Malinowski, M., and Battaglia, P. Transframer: Arbitrary frame prediction with generative models. arXiv preprint arXiv:2203.09494,
-
[26]
OpenAI. GPT-4 technical report. arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
URL https://pika.art/launch. Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M¨uller, J., Penna, J., and Rombach, R. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Zero-Shot Text-to-Image Generation
Ramesh, A., Pavlov, M., Goh, G., Gray, S., V oss, C., Radford, A., Chen, M., and Sutskever, I. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
K., Asawaroengchai, C., Nguyen, D
Rubenstein, P. K., Asawaroengchai, C., Nguyen, D. D., Bapna, A., Borsos, Z., Quitry, F. d. C., Chen, P., Badawy, D. E., Han, W., Kharitonov, E., et al. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925,
-
[31]
A step toward more inclusive people annotations for fairness
Schumann, C., Ricco, S., Prabhu, U., Ferrari, V ., and Pantofaru, C. A step toward more inclusive people annotations for fairness. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 916–925,
work page 2021
-
[32]
Make-A-Video: Text-to-Video Generation without Text-Video Data
URL https://openreview.net/forum? id=L9I9FhHfS3. Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al. Make-a-video: Text- to-video generation without text-video data. arXiv preprint arXiv:2209.14792,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Any- to-any generation via composable diffusion
11 VideoPoet: A Large Language Model for Zero-Shot Video Generation Tang, Z., Yang, Z., Zhu, C., Zeng, M., and Bansal, M. Any- to-any generation via composable diffusion. arXiv preprint arXiv:2305.11846,
-
[35]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., and Gelly, S. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
T., Castro, S., Kunze, J., and Erhan, D
Villegas, R., Babaeizadeh, M., Kindermans, P.-J., Moraldo, H., Zhang, H., Saffar, M. T., Castro, S., Kunze, J., and Erhan, D. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399,
-
[37]
ModelScope Text-to-Video Technical Report
Wang, J., Yuan, H., Chen, D., Zhang, Y ., Wang, X., and Zhang, S. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a. Wang, W., Xie, K., Liu, Z., Chen, H., Cao, Y ., Wang, X., and Shen, C. Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023b. Wang, W., Yang, H., Tuo, Z., ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
VideoGPT: Video Generation using VQ-VAE and Transformers
Yan, W., Zhang, Y ., Abbeel, P., and Srinivas, A. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Yu, J., Xu, Y ., Koh, J. Y ., Luong, T., Baid, G., Wang, Z., Vasudevan, V ., Ku, A., Yang, Y ., Ayan, B. K., et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Make pixels dance: High-dynamic video generation
Zeng, Y ., Wei, G., Zheng, J., Zou, J., Wei, Y ., Zhang, Y ., and Li, H. Make pixels dance: High-dynamic video generation. arXiv preprint arXiv:2311.10982,
-
[41]
Zhang, D. J., Wu, J. Z., Liu, J.-W., Zhao, R., Ran, L., Gu, Y ., Gao, D., and Shou, M. Z. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a. Zhang, L., Rao, A., and Agrawala, M. Adding conditional control to text-to-image diffusion models. In CVPR, pp. 3836–3847, 2023b. Zhang, Y ., Jian...
-
[42]
MagicVideo: Efficient Video Generation With Latent Diffusion Models
Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y ., and Feng, J. Mag- icvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018,
work page internal anchor Pith review arXiv
-
[43]
a {profession or people descriptor} looking {adverb} at the camera
12 VideoPoet: A Large Language Model for Zero-Shot Video Generation A. Appendix A.1. Responsible AI and Fairness Analysis We evaluate whether the generated outputs of our model are fair regarding protected attributes such as (1) Perceived Age (2) Perceived Gender Expression (3) Perceived Skin Tone. We construct 306 prompts with template — “a {profession o...
work page 2021
-
[44]
Both FVD and FAD metrics are calculated using a held-out subset of 25 thousand videos
and audio generation quality using the Fr´echet Audio Distance (FAD), which uses the VGGish model as the embedding function (Hershey et al., 2017). Both FVD and FAD metrics are calculated using a held-out subset of 25 thousand videos. Fig. 8 shows that as the model size grows and the amount of training data increases, performance improves across visual an...
work page 2017
-
[45]
Appendix A.2.1 shows a qualitative comparison of the 1B and 8B pretrained models. Increasing the model size improved temporal consistency, prompt fidelity, and motion dynamics while adding capabilities for limited text rendering, spatial understanding, and counting. (a) Video generation quality in FVD (↓). (b) Audio generation quality in FAD (↓). Figure 8...
work page 2022
-
[46]
content” or appearance of the output and the optical flow and depth control the “structure
to predict videos from the combination of text, optical flow, and depth signals. On a subset of steps, we also condition on the first video frame. As described in (Esser et al., 2023), the text will generally define the “content” or appearance of the output and the optical flow and depth control the “structure.” In contrast to the diffusion-based approach...
work page 2023
-
[47]
and produce monocular depth maps from MIDAS (Ranftl et al., 2020), and then normalize and concatenate on the channel dimension. This conveniently produces the 15 VideoPoet: A Large Language Model for Zero-Shot Video Generation Camera Motion: Arc shot Camera Motion: FPV drone shot Figure 11: Examples of directed camera movement from the same initial frame....
work page 2020
-
[48]
For more details, please refer to Appendix A.5.7
dataset and provide 2 style prompts for each video. For more details, please refer to Appendix A.5.7. Following (Esser et al., 2023), we evaluated the CLIP-embedding consistency between each frame and the text prompt to determine if the stylization results matches the text. As shown in Table 3, VideoPoet outperforms Control-A-Video conditioned on depth by...
work page 2023
-
[49]
in the second stage, respectively (Fig. 3). The cross-attention layers attend to local windows in the low-resolution sequence isomorphic to self-attention windows but with half the spatial size. We train the super-resolution stages on a dataset of 64M high-quality text-video pairs using the masked modeling objective of MAGVIT (Yu et al., 2023a), with toke...
work page 2024
-
[50]
scale (Ho & Salimans, 2022; Brooks et al.,
with 24 sampling steps for each stage and classifier-free guidance 17 VideoPoet: A Large Language Model for Zero-Shot Video Generation Figure 13: Example screenshot of the user interface for human side-by-side comparisons. scale (Ho & Salimans, 2022; Brooks et al.,
work page 2022
-
[51]
(2022), measure FVD (Unterthiner et al.,
following an implementation given by Villegas et al. (2022), measure FVD (Unterthiner et al.,
work page 2022
-
[52]
following Yu et al. (2023a) on UCF101 dataset and following Zhang et al. (2023a) on MSR-VTT, and measure Inception Score (IS) (Saito et al., 2020). When the evaluation protocol is on 16 frames, we discard the generated last frame to make a 16-frame video. A.5.4. A DDITIONAL HUMAN EVALUATION DETAILS Figure 13 shows an example screenshot of our side-by-side...
work page 2020
-
[53]
We first generate videos of 128 x 128 resolution and then resize to 256 x 256 via bicubic upsampling. Zero-shot MSR-VTT. For CLIP score, we used all 59,794 captions from the MSR-VTT test set. We use CLIP ViT-B/16 model following Phenaki (Villegas et al., 2022). We note that some papers use other CLIP models,e.g., VideoLDM (Blattmann et al., 2023b) uses Vi...
work page 2022
-
[54]
18 VideoPoet: A Large Language Model for Zero-Shot Video Generation to represent the 101 categories. To compute the FVD real features, we sample 10K videos from the training set, following TGAN2 (Saito et al., 2020). We sample the central 16 frames for each real video , without any temporal downsampling, i.e., we use the original fps in the UCF-101 datase...
work page 2020
-
[55]
is used as the primary metric, calculated with 16 frames at 128×128 resolution. We follow MAGVIT (Yu et al., 2023a) in evaluating these tasks against the respective real distribution, using 50000×4 samples for K600 and 50000 samples for SSv2. A.5.7. S TYLIZATION EVALUATION ON DAVIS To evaluate the CLIP similarity score and human preference on video styliz...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.