pith. machine review for the scientific record. sign in

arxiv: 2310.05737 · v3 · submitted 2023-10-09 · 💻 cs.CV · cs.AI· cs.MM

Recognition: 2 theorem links

· Lean Theorem

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM
keywords MAGVIT-v2visual tokenizerlanguage modelsdiffusion modelsimage generationvideo generationvideo compressionaction recognition
0
0 comments X

The pith

A new tokenizer lets language models outperform diffusion models on image and video generation benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MAGVIT-v2, a tokenizer that converts images and videos into concise discrete tokens sharing one vocabulary. When paired with large language models, this tokenizer produces higher-quality outputs than diffusion models on ImageNet for images and Kinetics for videos. The same tokenizer also achieves video compression quality comparable to next-generation codecs by human judgment and yields stronger features for action recognition. A reader would care because the result frames tokenization, rather than model family, as the decisive factor for applying language models to visual generation.

Core claim

Equipped with the MAGVIT-v2 tokenizer, large language models outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. The tokenizer generates concise and expressive tokens for both videos and images using a common token vocabulary and also surpasses prior video tokenizers on compression and representation learning for action recognition.

What carries the argument

MAGVIT-v2, a tokenizer that maps pixel inputs to discrete tokens using a shared vocabulary for images and videos.

If this is right

  • Language models become the stronger base architecture for both image and video generation once tokenization is addressed.
  • A single tokenizer supports high-quality generation across static images and dynamic video sequences.
  • Video compression reaches human-judged parity with next-generation codecs.
  • Token sequences from the tokenizer produce effective representations for downstream tasks such as action recognition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future scaling of language models may widen the advantage over diffusion models without requiring architecture-specific changes.
  • The same tokenizer could be tested in hybrid pipelines that combine language-model generation with other visual modules.
  • If token quality dominates, similar gains might appear when the tokenizer is applied to language-model variants trained on different objectives.

Load-bearing premise

Performance differences arise mainly from the tokenizer rather than from differences in model scale, training data, optimization, or evaluation protocols between the language-model and diffusion baselines.

What would settle it

A matched experiment in which diffusion models and language models are trained at identical scale and data using the new tokenizer, then show no quality gap or a reversal in favor of diffusion models, would falsify the claim.

read the original abstract

While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces MAGVIT-v2, a visual tokenizer that maps images and videos to concise discrete tokens from a shared vocabulary. Equipped with this tokenizer, the authors claim that large language models outperform diffusion models on image generation (ImageNet) and video generation (Kinetics) benchmarks. The tokenizer is additionally shown to surpass prior video tokenizers on video compression (human-evaluated as comparable to the VCC codec) and on learning representations for action recognition.

Significance. If the central empirical claim holds after isolating the tokenizer contribution, the result would be significant: it would demonstrate that discrete tokenization improvements can allow autoregressive LLMs to surpass diffusion models on standard visual generation benchmarks, supporting a unified LLM-based approach to multimodal generation. The tokenizer's reported utility for compression and representation learning further adds practical value.

major comments (1)
  1. [Sections 4 and 5] Sections 4 and 5: The benchmark comparisons of the LLM (with MAGVIT-v2) against diffusion baselines (e.g., ADM, DiT variants) lack tokenizer-swap ablations that train the identical LLM backbone with prior tokenizers such as MAGVIT-v1 or VQGAN under matched scale, data, and optimization conditions. Without these controls, performance gains cannot be confidently attributed to the tokenizer rather than differences in model capacity or training regime, which directly undermines the title claim that the tokenizer is key.
minor comments (2)
  1. [Abstract] The abstract states the central outperformance result without any quantitative metrics, specific baseline names, model sizes, or training details; these should be added for immediate evaluability.
  2. [Section on compression evaluation] The human evaluation protocol for video compression (claimed comparable to VCC) requires additional details on rater instructions, number of samples, and statistical significance to support reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify the contributions of our work. We address the major comment point-by-point below.

read point-by-point responses
  1. Referee: [Sections 4 and 5] Sections 4 and 5: The benchmark comparisons of the LLM (with MAGVIT-v2) against diffusion baselines (e.g., ADM, DiT variants) lack tokenizer-swap ablations that train the identical LLM backbone with prior tokenizers such as MAGVIT-v1 or VQGAN under matched scale, data, and optimization conditions. Without these controls, performance gains cannot be confidently attributed to the tokenizer rather than differences in model capacity or training regime, which directly undermines the title claim that the tokenizer is key.

    Authors: We agree that matched tokenizer-swap ablations on the identical LLM backbone would provide the strongest isolation of the tokenizer's contribution. In the manuscript, our primary comparisons are against published diffusion models (ADM, DiT) that use their own training regimes and often implicit or different tokenization strategies, making exact controls challenging. We do demonstrate MAGVIT-v2's superiority over MAGVIT-v1 and VQGAN on video compression and action recognition using matched backbones and data, which supports the tokenizer as a key enabler. Full-scale LLM retraining with prior tokenizers under identical conditions was not performed due to prohibitive compute costs, but we have added explicit discussion of this limitation and the supporting evidence from auxiliary tasks in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark comparison with no self-referential derivation or fitted predictions.

full rationale

The paper introduces MAGVIT-v2 as a new video/image tokenizer and reports empirical results showing LLMs equipped with it outperform published diffusion baselines on ImageNet and Kinetics. No equations, first-principles derivations, or predictions are presented that reduce by construction to the paper's own inputs, fitted parameters, or self-citations. The central claim rests on external benchmark numbers rather than any internal definitional loop, self-citation chain, or renaming of known results. This is a standard empirical architecture paper whose validity hinges on experimental controls, not on tautological reasoning.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim depends on the effectiveness of the newly introduced MAGVIT-v2 tokenizer as the decisive factor; this component is postulated without independent evidence or prior validation in the abstract.

invented entities (1)
  • MAGVIT-v2 tokenizer no independent evidence
    purpose: Maps pixel inputs to concise discrete tokens suitable for LLM training on images and video
    New component introduced to enable the claimed LLM superiority; no external validation supplied.

pith-pipeline@v0.9.0 · 5519 in / 1095 out tokens · 30510 ms · 2026-05-13T20:01:59.559261+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

    cs.LG 2026-05 unverdicted novelty 7.0

    PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...

  2. Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Masked Logit Nudging aligns visual autoregressive model logits with source token maps under target prompts inside cross-attention masks, delivering top image editing results on PIE benchmarks and strong reconstruction...

  3. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    cs.CV 2024-06 conditional novelty 7.0

    Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

  4. InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

    cs.CV 2026-05 conditional novelty 6.0

    InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.

  5. Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

    q-bio.BM 2026-05 unverdicted novelty 6.0

    Yeti is a compact tokenizer for protein structures that delivers strong codebook use, token diversity, and reconstruction while enabling from-scratch multimodal generation of plausible sequences and structures with 10...

  6. CASCADE: Context-Aware Relaxation for Speculative Image Decoding

    cs.CV 2026-05 unverdicted novelty 6.0

    CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...

  7. MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

    cs.CV 2026-05 unverdicted novelty 6.0

    MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.

  8. Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditi...

  9. End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

    cs.CV 2026-05 unverdicted novelty 6.0

    An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.

  10. VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

    cs.CV 2026-04 unverdicted novelty 6.0

    VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.

  11. dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

    cs.RO 2026-04 unverdicted novelty 6.0

    A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.

  12. Latent-Compressed Variational Autoencoder for Video Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.

  13. ELT: Elastic Looped Transformers for Visual Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.

  14. ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.

  15. CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    cs.RO 2024-11 unverdicted novelty 6.0

    CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...

  16. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    cs.CV 2024-08 unverdicted novelty 6.0

    CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.

  17. Co-Generative De Novo Functional Protein Design

    q-bio.QM 2026-05 unverdicted novelty 5.0

    CodeFP jointly generates protein sequences and structures using functional local structures and auxiliary supervision, yielding 6.1% better functional consistency and 3.2% better foldability than prior baselines.

  18. Open-Sora: Democratizing Efficient Video Production for All

    cs.CV 2024-12 unverdicted novelty 5.0

    Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tas...

  19. Movie Gen: A Cast of Media Foundation Models

    cs.CV 2024-10 unverdicted novelty 5.0

    A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

  20. Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    cs.CV 2024-08 unverdicted novelty 5.0

    Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.

  21. Seedance 1.0: Exploring the Boundaries of Video Generation Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.

  22. Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    cs.CV 2024-02 unverdicted novelty 2.0

    The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

291 extracted references · 291 canonical work pages · cited by 22 Pith papers · 23 internal anchors

  1. [1]

    Chang, Huiwen and Zhang, Han and Jiang, Lu and Liu, Ce and Freeman, William T , booktitle=CVPR, ignpages=. Mask

  2. [2]

    Long video generation with time-agnostic

    Ge, Songwei and Hayes, Thomas and Yang, Harry and Yin, Xi and Pang, Guan and Jacobs, David and Huang, Jia-Bin and Parikh, Devi , booktitle=ECCV, year=. Long video generation with time-agnostic

  3. [3]

    VISIGRAPP (5: VISAPP) , year=

    Latent Video Transformer , author=. VISIGRAPP (5: VISAPP) , year=

  4. [4]

    Scaling Autoregressive Video Models , author=

  5. [5]

    Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie , journal=. Cog

  6. [6]

    arXiv:1907.06571 , year=

    Adversarial video generation on complex datasets , author=. arXiv:1907.06571 , year=

  7. [7]

    Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks , author=

  8. [8]

    Le Moing, Guillaume and Ponce, Jean and Schmid, Cordelia , booktitle=NIPS, ignvolume=

  9. [9]

    Yan, Wilson and Zhang, Yunzhi and Abbeel, Pieter and Srinivas, Aravind , journal=. Video

  10. [10]

    arXiv:2003.04035 , year=

    Transformation-based adversarial video prediction on large-scale data , author=. arXiv:2003.04035 , year=

  11. [11]

    arXiv:2203.09494 , year=

    Transframer: Arbitrary Frame Prediction with Generative Models , author=. arXiv:2203.09494 , year=

  12. [12]

    Gupta, Agrim and Tian, Stephen and Zhang, Yunzhi and Wu, Jiajun and Mart. Mask

  13. [13]

    arXiv:2106.13195 , year=

    FitVid: Overfitting in pixel-level video prediction , author=. arXiv:2106.13195 , year=

  14. [14]

    Wu, Chenfei and Liang, Jian and Ji, Lei and Yang, Fan and Fang, Yuejian and Jiang, Daxin and Duan, Nan , booktitle=ECCV, year=

  15. [15]

    Video diffusion models , author=

  16. [16]

    2020 , publisher=

    Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan , author=. 2020 , publisher=

  17. [17]

    Generating videos with scene dynamics , author=

  18. [18]

    Temporal generative adversarial nets with singular value clipping , author=

  19. [19]

    Tulyakov, Sergey and Liu, Ming-Yu and Yang, Xiaodong and Kautz, Jan , booktitle=CVPR, ignpages=. Mo

  20. [20]

    arXiv:1810.02419 , year=

    Towards high resolution video generation with progressive growing of sliced wasserstein gans , author=. arXiv:1810.02419 , year=

  21. [21]

    Neural Networks , volume=

    Lower dimensional kernels for video discriminators , author=. Neural Networks , volume=. 2020 , publisher=

  22. [22]

    A Good Image Generator Is What You Need for High-Resolution Video Synthesis , author=

  23. [23]

    Taming transformers for high-resolution image synthesis , author=

  24. [24]

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=NAACL, ignpages=

  25. [25]

    Image as a Foreign Language:

    Wang, Wenhui and Bao, Hangbo and Dong, Li and Bjorck, Johan and Peng, Zhiliang and Liu, Qiang and Aggarwal, Kriti and Mohammed, Owais Khan and Singhal, Saksham and Som, Subhojit and others , journal=. Image as a Foreign Language:

  26. [26]

    Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak , journal=

  27. [27]

    , author=

    Self-Supervised Visual Planning with Temporal Skip Connections. , author=

  28. [28]

    A short note about

    Carreira, Joao and Noland, Eric and Banki-Horvath, Andras and Hillier, Chloe and Zisserman, Andrew , journal=. A short note about

  29. [29]

    The ``something something" video database for learning and evaluating visual common sense , author=

  30. [30]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Towards accurate generative models of video: A new metric & challenges , author=. arXiv:1812.01717 , year=

  31. [31]

    Quo vadis, action recognition? a new model and the

    Carreira, Joao and Zisserman, Andrew , booktitle=CVPR, ignpages=. Quo vadis, action recognition? a new model and the

  32. [32]

    Learning spatiotemporal features with 3

    Tran, Du and Bourdev, Lubomir and Fergus, Rob and Torresani, Lorenzo and Paluri, Manohar , booktitle=ICCV, ignpages=. Learning spatiotemporal features with 3

  33. [33]

    Vector quantized diffusion model for text-to-image synthesis , author=

  34. [34]

    Zero-shot text-to-image generation , author=

  35. [35]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , author=. arXiv:2206.10789 , year=

  36. [36]

    Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning , author=

  37. [37]

    Zhang, Zhu and Ma, Jianxin and Zhou, Chang and Men, Rui and Li, Zhikang and Ding, Ming and Tang, Jie and Zhou, Jingren and Yang, Hongxia , journal=. M6-

  38. [38]

    Vimpac: Video pre-training via masked token prediction and contrastive learning

    Vimpac: Video pre-training via masked token prediction and contrastive learning , author=. arXiv:2106.11250 , year=

  39. [39]

    Neural discrete representation learning , author=

  40. [40]

    Vector-quantized image modeling with improved

    Yu, Jiahui and Li, Xin and Koh, Jing Yu and Zhang, Han and Pang, Ruoming and Qin, James and Ku, Alexander and Xu, Yuanzhong and Baldridge, Jason and Wu, Yonghui , booktitle=ICLR, year=. Vector-quantized image modeling with improved

  41. [41]

    Ghazvininejad, Marjan and Levy, Omer and Liu, Yinhan and Zettlemoyer, Luke , booktitle=EMNLP-IJCNLP, ignpages=. Mask-

  42. [42]

    2004 , publisher=

    Image quality assessment: from error visibility to structural similarity , author=. 2004 , publisher=

  43. [43]

    The unreasonable effectiveness of deep features as a perceptual metric , author=

  44. [44]

    arXiv:2206.07696 , year=

    Diffusion Models for Video Prediction and Infilling , author=. arXiv:2206.07696 , year=

  45. [45]

    Large Scale

    Brock, Andrew and Donahue, Jeff and Simonyan, Karen , booktitle=ICLR, year=. Large Scale

  46. [46]

    Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=CVPR, ignpages=. Image. 2009 , ignorganization=

  47. [47]

    Improved Masked Image Generation with

    Lezama, Jos. Improved Masked Image Generation with

  48. [48]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , author=

  49. [49]

    Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade , author=

  50. [50]

    Kong, Xiang and Jiang, Lu and Chang, Huiwen and Zhang, Han and Hao, Yuan and Gong, Haifeng and Essa, Irfan , booktitle=ECCV, year=

  51. [51]

    Regularizing generative adversarial networks under limited data , author=

  52. [52]

    Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation , author=

  53. [53]

    Objectron: A large scale dataset of object-centric videos in the wild with pose annotations , author=

  54. [54]

    Caesar, Holger and Bankiti, Varun and Lang, Alex H and Vora, Sourabh and Liong, Venice Erin and Xu, Qiang and Krishnan, Anush and Pan, Yu and Baldan, Giancarlo and Beijbom, Oscar , booktitle=CVPR, ignpages=. nu

  55. [55]

    Ego4d: Around the world in 3,000 hours of egocentric video , author=

  56. [56]

    A style-based generator architecture for generative adversarial networks , author=

  57. [57]

    James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Vander

  58. [58]

    Making convolutional networks shift-invariant again , author=

  59. [59]

    Distill , year =

    Odena, Augustus and Dumoulin, Vincent and Olah, Chris , title =. Distill , year =

  60. [60]

    Deep residual learning for image recognition , author=

  61. [61]

    Ten lessons from three generations shaped Google’s

    Jouppi, Norman P and Yoon, Doe Hyun and Ashcraft, Matthew and Gottscho, Mark and Jablin, Thomas B and Kurian, George and Laudon, James and Li, Sheng and Ma, Peter and Ma, Xiaoyu and others , booktitle=ISCA, ignpages=. Ten lessons from three generations shaped Google’s. 2021 , ignorganization=

  62. [62]

    High-resolution image synthesis with latent diffusion models , author=

  63. [63]

    Align your latents: High-resolution video synthesis with latent diffusion models , author=

  64. [64]

    Ding, Ming and Yang, Zhuoyi and Hong, Wenyi and Zheng, Wendi and Zhou, Chang and Yin, Da and Lin, Junyang and Zou, Xu and Shao, Zhou and Yang, Hongxia and others , booktitle=NIPS, ignvolume=. Cog

  65. [65]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=

  66. [66]

    Adam: A Method for Stochastic Optimization

    Adam: A method for stochastic optimization , author=. arXiv:1412.6980 , year=

  67. [67]

    Language models are few-shot learners , author=

  68. [68]

    Generative pretraining from pixels , author=

  69. [69]

    Deep unsupervised learning using nonequilibrium thermodynamics , author=

  70. [70]

    Structured denoising diffusion models in discrete state-spaces , author=

  71. [71]

    Conditional image generation with pixelcnn decoders , author=

  72. [72]

    Pixel Recurrent Neural Networks , booktitle = ICML, year =

    A. Pixel Recurrent Neural Networks , booktitle = ICML, year =

  73. [73]

    Yang Song and Stefano Ermon , title =

  74. [74]

    Score-Based Generative Modeling through Stochastic Differential Equations , author=

  75. [75]

    Kingma and Tim Salimans and Ben Poole and Jonathan Ho , title =

    Diederik P. Kingma and Tim Salimans and Ben Poole and Jonathan Ho , title =. CoRR , volume =

  76. [77]

    Auto-Encoding Variational Bayes

    Auto-encoding variational bayes , author=. arXiv:1312.6114 , year=

  77. [78]

    arXiv:1905.09883 , year=

    Neural stochastic differential equations: Deep latent gaussian models in the diffusion limit , author=. arXiv:1905.09883 , year=

  78. [79]

    Denoising Diffusion Probabilistic Models , booktitle = NIPS, year =

    Jonathan Ho and Ajay Jain and Pieter Abbeel , igneditor =. Denoising Diffusion Probabilistic Models , booktitle = NIPS, year =

  79. [81]

    Photorealistic text-to-image diffusion models with deep language understanding , author=

  80. [82]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Make-a-video: Text-to-video generation without text-video data , author=. arXiv:2209.14792 , year=

Showing first 80 references.