arxiv: 2104.10157 · v2 · submitted 2021-04-20 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

VideoGPT: Video Generation using VQ-VAE and Transformers

Aravind Srinivas, Pieter Abbeel, Wilson Yan, Yunzhi Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:20 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords video generationVQ-VAEtransformersautoregressive modelingdiscrete latentsBAIR datasetUCF-101

0 comments

The pith

A VQ-VAE followed by an autoregressive transformer generates video samples competitive with GANs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VideoGPT as a straightforward two-stage method for likelihood-based video generation. It first trains a VQ-VAE with 3D convolutions and axial self-attention to produce compressed discrete latent codes from raw video frames, then fits a GPT-style transformer that predicts these codes autoregressively using explicit spatio-temporal position encodings. On the BAIR Robot Pushing dataset the resulting samples match the quality of current GAN baselines, while on UCF-101 and TGIF the model produces coherent natural videos without any adversarial training. A sympathetic reader would value the work because it replaces unstable GAN objectives with a simpler, more stable maximum-likelihood pipeline that still scales to realistic video dynamics.

Core claim

VideoGPT shows that a VQ-VAE can learn downsampled discrete latent representations of natural videos through 3D convolutions and axial self-attention, after which a standard GPT-like transformer with spatio-temporal position encodings can autoregressively model those latents to produce samples competitive with state-of-the-art GANs on the BAIR dataset and high-fidelity videos on UCF-101 and TGIF.

What carries the argument

VQ-VAE compression of video into discrete spatio-temporal codes followed by autoregressive next-code prediction with a GPT-style transformer.

If this is right

Video generation can be performed with maximum-likelihood training instead of adversarial objectives while remaining competitive on robot-pushing data.
The same architecture produces coherent human-action videos on UCF-101 and short natural clips on TGIF.
The two-stage discrete-latent approach offers a simpler training recipe than end-to-end GANs for video synthesis.
The model supplies a minimal, reproducible baseline for transformer-based video generation that others can extend.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the latent compression step generalizes to longer or higher-resolution videos, the autoregressive stage could be scaled without redesigning the entire pipeline.
Conditioning the transformer on additional inputs such as text or action labels would turn the same architecture into a controllable video generator.
The separation of compression and modeling stages may allow independent improvements to the VQ-VAE or the transformer without retraining the other component.

Load-bearing premise

The discrete codes produced by the VQ-VAE retain enough spatial and temporal detail that next-code prediction can recover high-fidelity video motion and appearance.

What would settle it

If videos generated by VideoGPT display markedly worse frame consistency or motion realism than real sequences when measured with the same quantitative metrics reported for BAIR, UCF-101, and TGIF, the central claim would be refuted.

read the original abstract

We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural videos from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at https://wilson1yan.github.io/videogpt/index.html

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VideoGPT delivers a simple likelihood-based video generation method competitive with GANs on BAIR.

read the letter

The key takeaway is that this paper presents a straightforward VQ-VAE and transformer architecture for video generation that achieves competitive performance with GANs on the BAIR dataset through standard likelihood training. They discretize videos using 3D convolutions and axial self-attention in the VQ-VAE to produce discrete latents, then model those latents autoregressively with a GPT-style transformer that uses spatio-temporal position encodings. Results include FVD scores on BAIR that match prior GAN work, plus qualitative high-fidelity samples on UCF-101 and TGIF. The paper does well by staying simple and stable. Likelihood training sidesteps GAN instability, the method is transparent, and they release code plus samples for easy reproduction. This makes it a practical reference for others. Soft spots are minor and not load-bearing. Quantitative metrics focus mainly on BAIR, with the other datasets leaning on visuals. A few hyperparameter details like exact codebook size could be spelled out more clearly in the text, though the code fills most gaps. The assumption that discrete latents preserve enough dynamics holds up in the samples but gets limited extra analysis. This paper suits researchers looking for a stable baseline in video generation or transformer applications to high-dimensional data. It is honest about its scope as a minimal implementation rather than a new frontier. I would bring it to a reading group to walk through the architecture choices. It deserves peer review because the experiments are on public data, the approach is reproducible, and it provides a credible likelihood alternative even if further work can build on it.

Referee Report

2 major / 2 minor

Summary. The manuscript presents VideoGPT, a two-stage model for video generation. It employs a VQ-VAE with 3D convolutions and axial self-attention to learn discrete spatio-temporal latent codes from raw video frames. These codes are then modeled autoregressively using a GPT-style transformer with spatio-temporal positional encodings. The authors report that the model generates samples competitive with state-of-the-art GANs on the BAIR Robot Pushing dataset (measured by FVD) and produces high-fidelity natural videos on UCF-101 and TGIF datasets.

Significance. This work offers a simple and reproducible likelihood-based alternative to adversarial methods for video generation. By releasing code and samples, it provides a useful baseline for future transformer-based video models. The approach demonstrates that discrete latents from VQ-VAE can capture sufficient information for high-quality autoregressive generation, which could influence hybrid VQ-transformer architectures in the field.

major comments (2)

[§4.1] §4.1 (BAIR experiments): The FVD scores are reported but no table lists the exact numerical values for the cited GAN baselines (e.g., MoCoGAN, Video Transformer); without these numbers the claim of being 'competitive with state-of-the-art GAN models' remains imprecise.
[§4.2] §4.2 (UCF-101 and TGIF): Only qualitative samples are shown; the absence of any quantitative metric (FVD, IS, or LPIPS) on these datasets makes the 'high fidelity natural videos' claim difficult to verify against prior work.

minor comments (2)

[§3.1] §3.1: The axial self-attention block inside the VQ-VAE encoder would be clearer if accompanied by a short equation or pseudocode showing the factorization along height/width/time axes.
[Figure 1] Figure 1 caption: The latent code dimensions (e.g., downsampling factor and codebook size) are not stated, which affects immediate readability of the architecture diagram.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. We address the two major comments point-by-point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§4.1] §4.1 (BAIR experiments): The FVD scores are reported but no table lists the exact numerical values for the cited GAN baselines (e.g., MoCoGAN, Video Transformer); without these numbers the claim of being 'competitive with state-of-the-art GAN models' remains imprecise.

Authors: We agree that a direct tabular comparison would make the competitiveness claim more precise. The manuscript reports our FVD score but does not tabulate the exact baseline numbers cited in the text. In the revised version we will add a table in §4.1 that lists FVD values for VideoGPT together with the reported scores from MoCoGAN, Video Transformer, and the other GAN baselines referenced in the section. revision: yes
Referee: [§4.2] §4.2 (UCF-101 and TGIF): Only qualitative samples are shown; the absence of any quantitative metric (FVD, IS, or LPIPS) on these datasets makes the 'high fidelity natural videos' claim difficult to verify against prior work.

Authors: We acknowledge that quantitative metrics would strengthen verifiability. The original submission emphasized qualitative results on UCF-101 and TGIF because of their high diversity and the computational cost of large-scale evaluation. In the revised manuscript we will compute and report FVD scores on these datasets (using the same protocol as BAIR) so that the high-fidelity claim can be directly compared with prior work. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents a standard two-stage architecture: a VQ-VAE (with 3D convolutions and axial attention) that produces discrete spatio-temporal latents from raw video, followed by a GPT-style autoregressive transformer that models those latents using position encodings. All training uses external datasets (BAIR, UCF-101, TGIF) with standard likelihood objectives; evaluation occurs on held-out test sets via FVD and qualitative inspection. No equation or claim reduces by construction to a fitted parameter renamed as a prediction, no self-definitional loop exists between components, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The derivation chain is fully external to the paper's own outputs and remains self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the effectiveness of VQ-VAE for video compression and standard transformer autoregression; no new entities are postulated.

free parameters (2)

VQ codebook size
Discrete vocabulary size in VQ-VAE chosen to balance reconstruction quality and modeling difficulty.
latent downsampling factor
Spatial-temporal compression ratio selected to make transformer sequence length tractable.

axioms (2)

domain assumption VQ-VAE can learn compact discrete representations that preserve video dynamics
Invoked when claiming the latents are sufficient for high-fidelity generation.
domain assumption Autoregressive modeling on discrete codes captures long-range spatio-temporal dependencies
Underlying the use of GPT-style prediction for video.

pith-pipeline@v0.9.0 · 5449 in / 1298 out tokens · 42026 ms · 2026-05-13T17:20:04.190779+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear
Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

E2E-WAVE: End-to-End Learned Waveform Generation for Underwater Video Multicasting
eess.SP 2026-04 unverdicted novelty 7.0

E2E-WAVE achieves +5 dB PSNR and real-time 16 FPS 128x128 video over 2.3 kbps underwater channels by learning waveforms that favor semantic similarity on decoding errors.
Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization
cs.CV 2026-04 unverdicted novelty 7.0

A hierarchical spatiotemporal vector quantization framework segments skeleton-based actions without supervision, achieving new state-of-the-art results on HuGaDB, LARa, and BABEL while reducing segment length bias.
HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.
Video Diffusion Models
cs.CV 2022-04 unverdicted novelty 7.0

A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance...
High-Resolution Image Synthesis with Latent Diffusion Models
cs.CV 2021-12 conditional novelty 7.0

Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...
Network-Efficient World Model Token Streaming
cs.RO 2026-05 unverdicted novelty 6.0

An adaptive delta-prioritization algorithm using cosine distance and Hamming-drift thresholds improves embedding distortion by 4.8-7.2% and next-token perplexity by 2.1-6.3% over periodic keyframing at matched low bit...
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
cs.CV 2026-05 unverdicted novelty 6.0

CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
Stream-T1: Test-Time Scaling for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
A Hybridizable Neural Time Integrator for Stable Autoregressive Forecasting
cs.LG 2026-04 unverdicted novelty 6.0

A hybrid transformer-FEM integrator provides provable discrete energy preservation and gradient bounds for stable autoregressive forecasting of chaotic systems, with 65x fewer parameters and 9000x speedup in a fusion ...
Animator-Centric Skeleton Generation on Objects with Fine-Grained Details
cs.GR 2026-04 unverdicted novelty 6.0

An animator-centric skeleton generation method that uses semantic-aware tokenization and a learnable density interval module to produce controllable, high-quality skeletons on complex 3D meshes.
Generative Refinement Networks for Visual Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 6.0

INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
cs.CV 2025-06 unverdicted novelty 6.0

Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...
Unified Video Action Model
cs.RO 2025-02 unverdicted novelty 6.0

UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
cs.CV 2024-08 unverdicted novelty 6.0

CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
Latte: Latent Diffusion Transformer for Video Generation
cs.CV 2024-01 unverdicted novelty 6.0

Latte achieves state-of-the-art video generation on FaceForensics, SkyTimelapse, UCF101, and Taichi-HD by using a latent diffusion transformer with four efficient spatial-temporal decomposition variants and best-pract...
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
cs.RO 2023-12 conditional novelty 6.0

A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
cs.CV 2022-05 unverdicted novelty 5.0

CogVideo is a large-scale transformer pretrained for text-to-video generation that outperforms public models in evaluations.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
High-Fidelity Full-Sky Video Prediction for Photovoltaic Ramp Event Forecasting
eess.SY 2026-05 unverdicted novelty 4.0

PhyDiffNet and RaPVFormer combine sky video prediction with ramp-aware power forecasting to achieve state-of-the-art PV ramp detection with a 10% CSI gain.
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 24 Pith papers · 14 internal anchors

[1]

arXiv:1810.02419 , year=

Acharya, D., Huang, Z., Paudel, D. P., and Van Gool, L. Towards high resolution video generation with progres- sive growing of sliced wasserstein gans. arXiv preprint arXiv:1810.02419,

work page arXiv
[2]

Layer Normalization

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

H., and Levine, S

Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., and Levine, S. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252,

work page arXiv
[4]

C., and Simonyan, K

Bi´nkowski, M., Donahue, J., Dieleman, S., Clark, A., Elsen, E., Casagrande, N., Cobo, L. C., and Simonyan, K. High ﬁdelity speech synthesis with adversarial networks.arXiv preprint arXiv:1909.11646,

work page arXiv 1909
[5]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high ﬁdelity natural image synthesis. arXiv preprint arXiv:1809.11096,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Language Models are Few-Shot Learners

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[7]

Pixelsnail: An improved autoregressive generative model

Chen, X., Mishra, N., Rohaninejad, M., and Abbeel, P. Pixelsnail: An improved autoregressive generative model. arXiv preprint arXiv:1712.09763,

work page arXiv
[8]

Very deep vaes generalize autoregressive models and can outperform them on images

Child, R. Very deep vaes generalize autoregressive models and can outperform them on images. arXiv preprint arXiv:2011.10650,

work page arXiv 2011
[9]

Generating Long Sequences with Sparse Transformers

Child, R., Gray, S., Radford, A., and Sutskever, I. Gen- erating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[10]

and Fergus, R

Denton, E. and Fergus, R. Stochastic video generation with a learned prior. arXiv preprint arXiv:1802.07687,

work page arXiv
[11]

Vladimir Gligorijevi´c, P

Dhariwal, P., Jun, H., Payne, C., Kim, J. W., Radford, A., and Sutskever, I. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341,

work page arXiv 2005
[12]

NICE: Non-linear Independent Components Estimation

Dinh, L., Krueger, D., and Bengio, Y . Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516,

work page internal anchor Pith review arXiv
[13]

Density estimation using Real NVP

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estima- tion using Real NVP. arXiv preprint arXiv:1605.08803,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

X., and Levine, S

Ebert, F., Finn, C., Lee, A. X., and Levine, S. Self- supervised visual planning with temporal skip connec- tions. arXiv preprint arXiv:1710.05268,

work page arXiv
[15]

Deep Residual Learning for Image Recognition

He, K., Zhang, X., Ren, S., and Sun, J. Deep resid- ual learning for image recognition. arXiv preprint arXiv:1512.03385,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Flow++: Improving ﬂow-based generative models with variational dequantization and architecture design

Ho, J., Chen, X., Srinivas, A., Duan, Y ., and Abbeel, P. Flow++: Improving ﬂow-based generative models with variational dequantization and architecture design. arXiv preprint arXiv:1902.00275, 2019a. Ho, J., Kalchbrenner, N., Weissenborn, D., and Salimans, T. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180, 2019b. Ho, J....

work page arXiv 1902
[17]

Kalchbrenner, N., Oord, A. v. d., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., and Kavukcuoglu, K. Video pixel networks. arXiv preprint arXiv:1610.00527,

work page arXiv
[18]

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progres- sive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Vizdoom: A doom-based ai research plat- form for visual reinforcement learning

Kempka, M., Wydmuch, M., Runc, G., Toczek, J., and Ja´skowski, W. Vizdoom: A doom-based ai research plat- form for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8. IEEE,

work page 2016
[20]

Kingma, D. P. and Dhariwal, P. Glow: Generative ﬂow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039,

work page arXiv
[21]

X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., and Levine, S

Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., and Levine, S. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523,

work page arXiv
[22]

Luc, P., Clark, A., Dieleman, S., Casas, D. d. L., Doron, Y ., Cassirer, A., and Simonyan, K. Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035,

work page arXiv 2003
[23]

and Kalchbrenner, N

Menick, J. and Kalchbrenner, N. Generating high ﬁdelity im- ages with subscale pixel networks and multidimensional upscaling. arXiv preprint arXiv:1812.01608,

work page arXiv
[24]

Oord, A. v. d., Li, Y ., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., Driessche, G. v. d., Lock- hart, E., Cobo, L. C., Stimberg, F., et al. Parallel wavenet: Fast high-ﬁdelity speech synthesis. arXiv preprint arXiv:1711.10433,

work page arXiv
[25]

Image transformer

Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser,Ł ., Shazeer, N., and Ku, A. Image transformer. arXiv preprint arXiv:1802.05751,

work page arXiv
[26]

Waveglow: A ﬂow-based generative network for speech synthesis

Prenger, R., Valle, R., and Catanzaro, B. Waveglow: A ﬂow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 3617–3621. IEEE,

work page 2019
[27]

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Radford, A., Metz, L., and Chintala, S. Unsupervised rep- resentation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Latent video transformer

Rakhimov, R., V olkhonskiy, D., Artemov, A., Zorin, D., and Burnaev, E. Latent video transformer. arXiv preprint arXiv:2006.10704,

work page arXiv 2006
[29]

Zero-Shot Text-to-Image Generation

Ramesh, A., Pavlov, M., Goh, G., Gray, S., V oss, C., Rad- ford, A., Chen, M., and Sutskever, I. Zero-shot text- to-image generation. arXiv preprint arXiv:2102.12092,

work page internal anchor Pith review arXiv
[30]

and Saito, S

Saito, M. and Saito, S. Tganv2: Efﬁcient training of large models for video generation with multiple subsampling layers. arXiv preprint arXiv:1811.09245,

work page arXiv
[31]

Improved Techniques for Training GANs

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V ., Radford, A., and Chen, X. Improved techniques for training gans. arXiv preprint arXiv:1606.03498,

work page Pith review arXiv
[32]

Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized lo- gistic mixture likelihood and other modiﬁcations. arXiv preprint arXiv:1701.05517,

work page arXiv
[33]

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilib- rium thermodynamics. arXiv preprint arXiv:1503.03585,

work page internal anchor Pith review arXiv
[34]

K., Espeholt, L., Heek, J., Dehghani, M., Oliver, A., Salimans, T., Agrawal, S., Hickey, J., and Kalchbrenner, N

Sønderby, C. K., Espeholt, L., Heek, J., Dehghani, M., Oliver, A., Salimans, T., Agrawal, S., Hickey, J., and Kalchbrenner, N. Metnet: A neural weather model for pre- cipitation forecasting. arXiv preprint arXiv:2003.12140,

work page arXiv 2003
[35]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

NV AE: A deep hierarchical variational autoencoder,

Vahdat, A. and Kautz, J. Nvae: A deep hierarchical vari- ational autoencoder. arXiv preprint arXiv:2007.03898 ,

work page arXiv 2007
[37]

WaveNet: A Generative Model for Raw Audio

van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016a. van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks.International Conference on Machine Learning (ICML) , 20...

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Attention Is All You Need

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. arXiv preprint arXiv:1706.03762,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Walker, J., Razavi, A., and Oord, A. v. d. Predicting video with vqvae. arXiv preprint arXiv:2103.01950,

work page arXiv
[40]

Video-to-video synthesis

Wang, T.-C., Liu, M.-Y ., Zhu, J.-Y ., Liu, G., Tao, A., Kautz, J., and Catanzaro, B. Video-to-video synthesis. arXiv preprint arXiv:1808.06601,

work page arXiv
[41]

Non-local neural networks

Wang, X., Girshick, R., Gupta, A., and He, K. Non-local neural networks. arXiv preprint arXiv:1711.07971,

work page arXiv
[42]

Scaling Autoregressive Video Models

Weissenborn, D., T ¨ackstr¨om, O., and Uszkoreit, J. Scal- ing autoregressive video models. arXiv preprint arXiv:1906.02634,

work page arXiv 1906
[43]

ViZDoom Samples Figure

Hyperparameters of prior networks for each dataset Moving MNIST BAIR / RoboNet ViZDoom UCF-101 / TGIF Input size 4× 16× 16 8 × 32× 32 8 × 32× 32 4 × 32× 32 Conditional sizes 1× 64× 64 3 × 64× 64, 64 60 (HGS), 315 (Battle2) n/a Batch size 32 32 32 32 Learning rate 3× 10−4 3× 10−4 3× 10−4 3× 10−4 V ocabulary size 512 1024 1024 1024 Attention heads 4 4 4 8 A...

work page 2048