pith. machine review for the scientific record. sign in

arxiv: 2104.10157 · v2 · submitted 2021-04-20 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

VideoGPT: Video Generation using VQ-VAE and Transformers

Aravind Srinivas, Pieter Abbeel, Wilson Yan, Yunzhi Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:20 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords video generationVQ-VAEtransformersautoregressive modelingdiscrete latentsBAIR datasetUCF-101
0
0 comments X

The pith

A VQ-VAE followed by an autoregressive transformer generates video samples competitive with GANs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VideoGPT as a straightforward two-stage method for likelihood-based video generation. It first trains a VQ-VAE with 3D convolutions and axial self-attention to produce compressed discrete latent codes from raw video frames, then fits a GPT-style transformer that predicts these codes autoregressively using explicit spatio-temporal position encodings. On the BAIR Robot Pushing dataset the resulting samples match the quality of current GAN baselines, while on UCF-101 and TGIF the model produces coherent natural videos without any adversarial training. A sympathetic reader would value the work because it replaces unstable GAN objectives with a simpler, more stable maximum-likelihood pipeline that still scales to realistic video dynamics.

Core claim

VideoGPT shows that a VQ-VAE can learn downsampled discrete latent representations of natural videos through 3D convolutions and axial self-attention, after which a standard GPT-like transformer with spatio-temporal position encodings can autoregressively model those latents to produce samples competitive with state-of-the-art GANs on the BAIR dataset and high-fidelity videos on UCF-101 and TGIF.

What carries the argument

VQ-VAE compression of video into discrete spatio-temporal codes followed by autoregressive next-code prediction with a GPT-style transformer.

If this is right

  • Video generation can be performed with maximum-likelihood training instead of adversarial objectives while remaining competitive on robot-pushing data.
  • The same architecture produces coherent human-action videos on UCF-101 and short natural clips on TGIF.
  • The two-stage discrete-latent approach offers a simpler training recipe than end-to-end GANs for video synthesis.
  • The model supplies a minimal, reproducible baseline for transformer-based video generation that others can extend.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the latent compression step generalizes to longer or higher-resolution videos, the autoregressive stage could be scaled without redesigning the entire pipeline.
  • Conditioning the transformer on additional inputs such as text or action labels would turn the same architecture into a controllable video generator.
  • The separation of compression and modeling stages may allow independent improvements to the VQ-VAE or the transformer without retraining the other component.

Load-bearing premise

The discrete codes produced by the VQ-VAE retain enough spatial and temporal detail that next-code prediction can recover high-fidelity video motion and appearance.

What would settle it

If videos generated by VideoGPT display markedly worse frame consistency or motion realism than real sequences when measured with the same quantitative metrics reported for BAIR, UCF-101, and TGIF, the central claim would be refuted.

read the original abstract

We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural videos from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at https://wilson1yan.github.io/videogpt/index.html

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents VideoGPT, a two-stage model for video generation. It employs a VQ-VAE with 3D convolutions and axial self-attention to learn discrete spatio-temporal latent codes from raw video frames. These codes are then modeled autoregressively using a GPT-style transformer with spatio-temporal positional encodings. The authors report that the model generates samples competitive with state-of-the-art GANs on the BAIR Robot Pushing dataset (measured by FVD) and produces high-fidelity natural videos on UCF-101 and TGIF datasets.

Significance. This work offers a simple and reproducible likelihood-based alternative to adversarial methods for video generation. By releasing code and samples, it provides a useful baseline for future transformer-based video models. The approach demonstrates that discrete latents from VQ-VAE can capture sufficient information for high-quality autoregressive generation, which could influence hybrid VQ-transformer architectures in the field.

major comments (2)
  1. [§4.1] §4.1 (BAIR experiments): The FVD scores are reported but no table lists the exact numerical values for the cited GAN baselines (e.g., MoCoGAN, Video Transformer); without these numbers the claim of being 'competitive with state-of-the-art GAN models' remains imprecise.
  2. [§4.2] §4.2 (UCF-101 and TGIF): Only qualitative samples are shown; the absence of any quantitative metric (FVD, IS, or LPIPS) on these datasets makes the 'high fidelity natural videos' claim difficult to verify against prior work.
minor comments (2)
  1. [§3.1] §3.1: The axial self-attention block inside the VQ-VAE encoder would be clearer if accompanied by a short equation or pseudocode showing the factorization along height/width/time axes.
  2. [Figure 1] Figure 1 caption: The latent code dimensions (e.g., downsampling factor and codebook size) are not stated, which affects immediate readability of the architecture diagram.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. We address the two major comments point-by-point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4.1] §4.1 (BAIR experiments): The FVD scores are reported but no table lists the exact numerical values for the cited GAN baselines (e.g., MoCoGAN, Video Transformer); without these numbers the claim of being 'competitive with state-of-the-art GAN models' remains imprecise.

    Authors: We agree that a direct tabular comparison would make the competitiveness claim more precise. The manuscript reports our FVD score but does not tabulate the exact baseline numbers cited in the text. In the revised version we will add a table in §4.1 that lists FVD values for VideoGPT together with the reported scores from MoCoGAN, Video Transformer, and the other GAN baselines referenced in the section. revision: yes

  2. Referee: [§4.2] §4.2 (UCF-101 and TGIF): Only qualitative samples are shown; the absence of any quantitative metric (FVD, IS, or LPIPS) on these datasets makes the 'high fidelity natural videos' claim difficult to verify against prior work.

    Authors: We acknowledge that quantitative metrics would strengthen verifiability. The original submission emphasized qualitative results on UCF-101 and TGIF because of their high diversity and the computational cost of large-scale evaluation. In the revised manuscript we will compute and report FVD scores on these datasets (using the same protocol as BAIR) so that the high-fidelity claim can be directly compared with prior work. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents a standard two-stage architecture: a VQ-VAE (with 3D convolutions and axial attention) that produces discrete spatio-temporal latents from raw video, followed by a GPT-style autoregressive transformer that models those latents using position encodings. All training uses external datasets (BAIR, UCF-101, TGIF) with standard likelihood objectives; evaluation occurs on held-out test sets via FVD and qualitative inspection. No equation or claim reduces by construction to a fitted parameter renamed as a prediction, no self-definitional loop exists between components, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The derivation chain is fully external to the paper's own outputs and remains self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the effectiveness of VQ-VAE for video compression and standard transformer autoregression; no new entities are postulated.

free parameters (2)
  • VQ codebook size
    Discrete vocabulary size in VQ-VAE chosen to balance reconstruction quality and modeling difficulty.
  • latent downsampling factor
    Spatial-temporal compression ratio selected to make transformer sequence length tractable.
axioms (2)
  • domain assumption VQ-VAE can learn compact discrete representations that preserve video dynamics
    Invoked when claiming the latents are sufficient for high-fidelity generation.
  • domain assumption Autoregressive modeling on discrete codes captures long-range spatio-temporal dependencies
    Underlying the use of GPT-style prediction for video.

pith-pipeline@v0.9.0 · 5449 in / 1298 out tokens · 42026 ms · 2026-05-13T17:20:04.190779+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. E2E-WAVE: End-to-End Learned Waveform Generation for Underwater Video Multicasting

    eess.SP 2026-04 unverdicted novelty 7.0

    E2E-WAVE achieves +5 dB PSNR and real-time 16 FPS 128x128 video over 2.3 kbps underwater channels by learning waveforms that favor semantic similarity on decoding errors.

  2. Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization

    cs.CV 2026-04 unverdicted novelty 7.0

    A hierarchical spatiotemporal vector quantization framework segments skeleton-based actions without supervision, achieving new state-of-the-art results on HuGaDB, LARa, and BABEL while reducing segment length bias.

  3. HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.

  4. Video Diffusion Models

    cs.CV 2022-04 unverdicted novelty 7.0

    A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance...

  5. High-Resolution Image Synthesis with Latent Diffusion Models

    cs.CV 2021-12 conditional novelty 7.0

    Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...

  6. Network-Efficient World Model Token Streaming

    cs.RO 2026-05 unverdicted novelty 6.0

    An adaptive delta-prioritization algorithm using cosine distance and Hamming-drift thresholds improves embedding distortion by 4.8-7.2% and next-token perplexity by 2.1-6.3% over periodic keyframing at matched low bit...

  7. CASCADE: Context-Aware Relaxation for Speculative Image Decoding

    cs.CV 2026-05 unverdicted novelty 6.0

    CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...

  8. Stream-T1: Test-Time Scaling for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...

  9. A Hybridizable Neural Time Integrator for Stable Autoregressive Forecasting

    cs.LG 2026-04 unverdicted novelty 6.0

    A hybrid transformer-FEM integrator provides provable discrete energy preservation and gradient bounds for stable autoregressive forecasting of chaotic systems, with 65x fewer parameters and 9000x speedup in a fusion ...

  10. Animator-Centric Skeleton Generation on Objects with Fine-Grained Details

    cs.GR 2026-04 unverdicted novelty 6.0

    An animator-centric skeleton generation method that uses semantic-aware tokenization and a learnable density interval module to produce controllable, high-quality skeletons on complex 3D meshes.

  11. Generative Refinement Networks for Visual Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

  12. INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

    cs.CV 2026-04 unverdicted novelty 6.0

    INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...

  13. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  14. Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    cs.CV 2025-06 unverdicted novelty 6.0

    Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...

  15. Unified Video Action Model

    cs.RO 2025-02 unverdicted novelty 6.0

    UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...

  16. Emu3: Next-Token Prediction is All You Need

    cs.CV 2024-09 unverdicted novelty 6.0

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  17. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    cs.CV 2024-08 unverdicted novelty 6.0

    CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.

  18. Latte: Latent Diffusion Transformer for Video Generation

    cs.CV 2024-01 unverdicted novelty 6.0

    Latte achieves state-of-the-art video generation on FaceForensics, SkyTimelapse, UCF101, and Taichi-HD by using a latent diffusion transformer with four efficient spatial-temporal decomposition variants and best-pract...

  19. Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    cs.RO 2023-12 conditional novelty 6.0

    A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.

  20. Movie Gen: A Cast of Media Foundation Models

    cs.CV 2024-10 unverdicted novelty 5.0

    A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

  21. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    cs.CV 2022-05 unverdicted novelty 5.0

    CogVideo is a large-scale transformer pretrained for text-to-video generation that outperforms public models in evaluations.

  22. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  23. High-Fidelity Full-Sky Video Prediction for Photovoltaic Ramp Event Forecasting

    eess.SY 2026-05 unverdicted novelty 4.0

    PhyDiffNet and RaPVFormer combine sky video prediction with ramp-aware power forecasting to achieve state-of-the-art PV ramp detection with a 10% CSI gain.

  24. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 24 Pith papers · 14 internal anchors

  1. [1]

    arXiv:1810.02419 , year=

    Acharya, D., Huang, Z., Paudel, D. P., and Van Gool, L. Towards high resolution video generation with progres- sive growing of sliced wasserstein gans. arXiv preprint arXiv:1810.02419,

  2. [2]

    Layer Normalization

    Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,

  3. [3]

    H., and Levine, S

    Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., and Levine, S. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252,

  4. [4]

    C., and Simonyan, K

    Bi´nkowski, M., Donahue, J., Dieleman, S., Clark, A., Elsen, E., Casagrande, N., Cobo, L. C., and Simonyan, K. High fidelity speech synthesis with adversarial networks.arXiv preprint arXiv:1909.11646,

  5. [5]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096,

  6. [6]

    Language Models are Few-Shot Learners

    Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165,

  7. [7]

    Pixelsnail: An improved autoregressive generative model

    Chen, X., Mishra, N., Rohaninejad, M., and Abbeel, P. Pixelsnail: An improved autoregressive generative model. arXiv preprint arXiv:1712.09763,

  8. [8]

    Very deep vaes generalize autoregressive models and can outperform them on images

    Child, R. Very deep vaes generalize autoregressive models and can outperform them on images. arXiv preprint arXiv:2011.10650,

  9. [9]

    Generating Long Sequences with Sparse Transformers

    Child, R., Gray, S., Radford, A., and Sutskever, I. Gen- erating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,

  10. [10]

    and Fergus, R

    Denton, E. and Fergus, R. Stochastic video generation with a learned prior. arXiv preprint arXiv:1802.07687,

  11. [11]

    Vladimir Gligorijevi´c, P

    Dhariwal, P., Jun, H., Payne, C., Kim, J. W., Radford, A., and Sutskever, I. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341,

  12. [12]

    NICE: Non-linear Independent Components Estimation

    Dinh, L., Krueger, D., and Bengio, Y . Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516,

  13. [13]

    Density estimation using Real NVP

    Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estima- tion using Real NVP. arXiv preprint arXiv:1605.08803,

  14. [14]

    X., and Levine, S

    Ebert, F., Finn, C., Lee, A. X., and Levine, S. Self- supervised visual planning with temporal skip connec- tions. arXiv preprint arXiv:1710.05268,

  15. [15]

    Deep Residual Learning for Image Recognition

    He, K., Zhang, X., Ren, S., and Sun, J. Deep resid- ual learning for image recognition. arXiv preprint arXiv:1512.03385,

  16. [16]

    Flow++: Improving flow-based generative models with variational dequantization and architecture design

    Ho, J., Chen, X., Srinivas, A., Duan, Y ., and Abbeel, P. Flow++: Improving flow-based generative models with variational dequantization and architecture design. arXiv preprint arXiv:1902.00275, 2019a. Ho, J., Kalchbrenner, N., Weissenborn, D., and Salimans, T. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180, 2019b. Ho, J....

  17. [17]

    Kalchbrenner, N., Oord, A. v. d., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., and Kavukcuoglu, K. Video pixel networks. arXiv preprint arXiv:1610.00527,

  18. [18]

    Progressive Growing of GANs for Improved Quality, Stability, and Variation

    Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progres- sive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196,

  19. [19]

    Vizdoom: A doom-based ai research plat- form for visual reinforcement learning

    Kempka, M., Wydmuch, M., Runc, G., Toczek, J., and Ja´skowski, W. Vizdoom: A doom-based ai research plat- form for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8. IEEE,

  20. [20]

    Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039,

  21. [21]

    X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., and Levine, S

    Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., and Levine, S. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523,

  22. [22]

    Luc, P., Clark, A., Dieleman, S., Casas, D. d. L., Doron, Y ., Cassirer, A., and Simonyan, K. Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035,

  23. [23]

    and Kalchbrenner, N

    Menick, J. and Kalchbrenner, N. Generating high fidelity im- ages with subscale pixel networks and multidimensional upscaling. arXiv preprint arXiv:1812.01608,

  24. [24]

    Oord, A. v. d., Li, Y ., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., Driessche, G. v. d., Lock- hart, E., Cobo, L. C., Stimberg, F., et al. Parallel wavenet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433,

  25. [25]

    Image transformer

    Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser,Ł ., Shazeer, N., and Ku, A. Image transformer. arXiv preprint arXiv:1802.05751,

  26. [26]

    Waveglow: A flow-based generative network for speech synthesis

    Prenger, R., Valle, R., and Catanzaro, B. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 3617–3621. IEEE,

  27. [27]

    Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

    Radford, A., Metz, L., and Chintala, S. Unsupervised rep- resentation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434,

  28. [28]

    Latent video transformer

    Rakhimov, R., V olkhonskiy, D., Artemov, A., Zorin, D., and Burnaev, E. Latent video transformer. arXiv preprint arXiv:2006.10704,

  29. [29]

    Zero-Shot Text-to-Image Generation

    Ramesh, A., Pavlov, M., Goh, G., Gray, S., V oss, C., Rad- ford, A., Chen, M., and Sutskever, I. Zero-shot text- to-image generation. arXiv preprint arXiv:2102.12092,

  30. [30]

    and Saito, S

    Saito, M. and Saito, S. Tganv2: Efficient training of large models for video generation with multiple subsampling layers. arXiv preprint arXiv:1811.09245,

  31. [31]

    Improved Techniques for Training GANs

    Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V ., Radford, A., and Chen, X. Improved techniques for training gans. arXiv preprint arXiv:1606.03498,

  32. [32]

    Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized lo- gistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517,

  33. [33]

    Deep Unsupervised Learning using Nonequilibrium Thermodynamics

    Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilib- rium thermodynamics. arXiv preprint arXiv:1503.03585,

  34. [34]

    K., Espeholt, L., Heek, J., Dehghani, M., Oliver, A., Salimans, T., Agrawal, S., Hickey, J., and Kalchbrenner, N

    Sønderby, C. K., Espeholt, L., Heek, J., Dehghani, M., Oliver, A., Salimans, T., Agrawal, S., Hickey, J., and Kalchbrenner, N. Metnet: A neural weather model for pre- cipitation forecasting. arXiv preprint arXiv:2003.12140,

  35. [35]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402,

  36. [36]

    NV AE: A deep hierarchical variational autoencoder,

    Vahdat, A. and Kautz, J. Nvae: A deep hierarchical vari- ational autoencoder. arXiv preprint arXiv:2007.03898 ,

  37. [37]

    WaveNet: A Generative Model for Raw Audio

    van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016a. van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks.International Conference on Machine Learning (ICML) , 20...

  38. [38]

    Attention Is All You Need

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. arXiv preprint arXiv:1706.03762,

  39. [39]

    Walker, J., Razavi, A., and Oord, A. v. d. Predicting video with vqvae. arXiv preprint arXiv:2103.01950,

  40. [40]

    Video-to-video synthesis

    Wang, T.-C., Liu, M.-Y ., Zhu, J.-Y ., Liu, G., Tao, A., Kautz, J., and Catanzaro, B. Video-to-video synthesis. arXiv preprint arXiv:1808.06601,

  41. [41]

    Non-local neural networks

    Wang, X., Girshick, R., Gupta, A., and He, K. Non-local neural networks. arXiv preprint arXiv:1711.07971,

  42. [42]

    Scaling Autoregressive Video Models

    Weissenborn, D., T ¨ackstr¨om, O., and Uszkoreit, J. Scal- ing autoregressive video models. arXiv preprint arXiv:1906.02634,

  43. [43]

    ViZDoom Samples Figure

    Hyperparameters of prior networks for each dataset Moving MNIST BAIR / RoboNet ViZDoom UCF-101 / TGIF Input size 4× 16× 16 8 × 32× 32 8 × 32× 32 4 × 32× 32 Conditional sizes 1× 64× 64 3 × 64× 64, 64 60 (HGS), 315 (Battle2) n/a Batch size 32 32 32 32 Learning rate 3× 10−4 3× 10−4 3× 10−4 3× 10−4 V ocabulary size 512 1024 1024 1024 Attention heads 4 4 4 8 A...