pith. machine review for the scientific record. sign in

arxiv: 2206.10789 · v1 · submitted 2022-06-22 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Alexander Ku, Ben Hutchinson, Burcu Karagol Ayan, Gunjan Baid, Han Zhang, Jason Baldridge, Jiahui Yu, Jing Yu Koh, Thang Luong, Vijay Vasudevan, Wei Han, Xin Li, Yinfei Yang, Yonghui Wu, Yuanzhong Xu, Zarana Parekh, Zirui Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:45 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords text-to-image generationautoregressive modelingtransformer scalingViT-VQGANFID evaluationMS-COCO benchmarkcomplex scene synthesissequence-to-sequence vision
0
0 comments X

The pith

Scaling an autoregressive encoder-decoder Transformer to 20 billion parameters produces high-fidelity images from text prompts that include complex compositions and world knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that text-to-image generation can be framed as a sequence-to-sequence problem where text tokens map to image tokens. A ViT-VQGAN first converts images into discrete token sequences, after which an encoder-decoder Transformer is trained autoregressively to predict those image tokens from text. Scaling this model from smaller sizes up to 20 billion parameters yields steady gains in photorealism and content richness, measured by new state-of-the-art zero-shot and finetuned FID scores on MS-COCO. The same scaling also improves results across varied prompt categories in a new benchmark of over 1600 English prompts and in Localized Narratives.

Core claim

By treating image synthesis as autoregressive token prediction in the same style as machine translation, and by scaling the underlying encoder-decoder Transformer to 20 billion parameters, the model reaches zero-shot FID of 7.23 and finetuned FID of 3.22 on MS-COCO while generating images that respect complex spatial arrangements and external knowledge.

What carries the argument

The encoder-decoder Transformer that receives text token sequences and autoregressively outputs image token sequences, scaled to 20 billion parameters, after images have been tokenized by ViT-VQGAN.

If this is right

  • Image quality and compositional accuracy rise consistently as the Transformer grows larger.
  • The model demonstrates stronger handling of world knowledge and detailed scene descriptions than prior text-to-image systems.
  • Performance can be measured holistically across prompt difficulty using the new PartiPrompts benchmark.
  • Limitations in current outputs define concrete targets for the next round of improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scaling laws observed in language modeling appear to transfer when images are represented as token sequences.
  • A single autoregressive architecture could eventually support joint text-image generation and understanding tasks.
  • Future work could test whether the same tokenizer-plus-Transformer recipe extends to video or 3D content without major redesign.

Load-bearing premise

That the discrete tokens from the image tokenizer retain enough visual information and that larger model sizes will continue to reduce error rates on intricate scenes without new failure modes.

What would settle it

If further increases in model size beyond 20 billion parameters produce no additional drop in FID on MS-COCO or fail to improve accuracy on prompts that require precise multi-object spatial relationships, the claim that scaling drives the gains would be refuted.

read the original abstract

We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript presents the Parti model for text-to-image generation, framing the task as sequence-to-sequence modeling. Images are first tokenized into discrete sequences via a fixed ViT-VQGAN encoder; an encoder-decoder Transformer is then scaled up to 20B parameters to autoregressively predict the image-token sequence conditioned on text. The central empirical claims are consistent quality gains from scaling, a new zero-shot SOTA FID of 7.23 and finetuned FID of 3.22 on MS-COCO, plus supporting evaluations on Localized Narratives and the new PartiPrompts benchmark of >1600 prompts.

Significance. If the reported scaling behavior holds, the work demonstrates that autoregressive modeling—directly importing techniques and scaling laws from large language models—can produce high-fidelity, content-rich images without diffusion-style iterative refinement. The concrete FID numbers on standard benchmarks, the introduction of PartiPrompts, and the explicit discussion of limitations constitute clear, falsifiable contributions that can guide subsequent research. The empirical nature of the results (direct measurements on held-out data) avoids circularity.

minor comments (3)
  1. The abstract states that scaling yields 'consistent quality improvements,' yet no table or figure in the provided summary shows the per-scale FID or human-preference curves that would make this claim directly verifiable; such a plot should be added or referenced by section number.
  2. The ViT-VQGAN tokenizer is described as fixed; a brief quantitative statement of its reconstruction FID or perceptual loss on the training distribution would help readers assess whether observed generation gains are tokenizer-bounded.
  3. The new PartiPrompts benchmark is introduced as 'holistic,' but the manuscript should explicitly list the difficulty axes (e.g., counting, spatial relations, world knowledge) and the number of prompts per axis so that future work can replicate the evaluation protocol.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The recognition of our scaling results, new SOTA FID scores, PartiPrompts benchmark, and explicit limitations discussion is appreciated. No specific major comments were listed in the report, so we have no point-by-point rebuttals to provide.

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements

full rationale

The paper reports empirical outcomes from training encoder-decoder Transformers of increasing size (up to 20B parameters) on sequences of discrete image tokens produced by a fixed ViT-VQGAN tokenizer. Performance is quantified via zero-shot and finetuned FID scores on held-out MS-COCO splits plus qualitative analysis on Localized Narratives and PartiPrompts. No derivation chain, equations, or self-citations are invoked to obtain these scores; the reported improvements are measured results on external benchmarks rather than quantities forced by construction from model parameters or prior self-citations. The tokenizer is treated as an independent preprocessing step whose reconstruction fidelity is not derived from the autoregressive model itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim rests on empirical scaling benefits and the assumption that discrete token sequences from ViT-VQGAN suffice for photorealistic generation.

free parameters (1)
  • Transformer model size (up to 20B)
    Chosen scale at which quality improvements are observed; not derived from first principles.
axioms (1)
  • domain assumption Images can be lossily but usefully represented as sequences of discrete tokens from a ViT-VQGAN tokenizer
    Invoked to justify the sequence-to-sequence framing.

pith-pipeline@v0.9.0 · 5608 in / 1221 out tokens · 47223 ms · 2026-05-12T04:45:30.299538+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

    cs.DC 2026-04 unverdicted novelty 8.0

    Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.

  2. Prompt-to-Prompt Image Editing with Cross Attention Control

    cs.CV 2022-08 unverdicted novelty 8.0

    Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.

  3. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    cs.CV 2022-08 unverdicted novelty 8.0

    Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

  4. Does Engram Do Memory Retrieval in Autoregressive Image Generation?

    cs.CV 2026-05 accept novelty 7.0

    Engram in AR image generation saves backbone FLOPs but trails pure AR baselines in FID and behaves as a gated side-pathway rather than a content-addressed retriever.

  5. STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models

    cs.CV 2026-05 unverdicted novelty 7.0

    STRIDE boosts diversity in one-step diffusion models by injecting PCA-aligned pink noise into transformer features while preserving text alignment and quality.

  6. ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.

  7. Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.

  8. VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection

    cs.CV 2026-05 unverdicted novelty 7.0

    VoxAfford fuses multi-scale voxel features into MLLM output tokens using cross-attention with a learned compatibility gate to achieve SOTA open-vocabulary 3D affordance detection with ~8% mIoU gain and zero-shot robot...

  9. Knowledge Visualization: A Benchmark and Method for Knowledge-Intensive Text-to-Image Generation

    cs.CV 2026-04 conditional novelty 7.0

    KVBench reveals major gaps in current T2I models for knowledge-intensive tasks, and KE-Check narrows the gap between open- and closed-source models by adding structured knowledge and enforcing constraints.

  10. Unified Reward Model for Multimodal Understanding and Generation

    cs.CV 2025-03 unverdicted novelty 7.0

    UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.

  11. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    cs.CV 2024-06 conditional novelty 7.0

    Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

  12. ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    cs.CV 2024-03 unverdicted novelty 7.0

    ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.

  13. Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    cs.CV 2023-10 unverdicted novelty 7.0

    A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

  14. Visual Instruction Tuning

    cs.CV 2023-04 unverdicted novelty 7.0

    LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

  15. Scalable Diffusion Models with Transformers

    cs.CV 2022-12 unverdicted novelty 7.0

    DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.

  16. LAION-5B: An open large-scale dataset for training next generation image-text models

    cs.CV 2022-10 accept novelty 7.0

    LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.

  17. Imagen Video: High Definition Video Generation with Diffusion Models

    cs.CV 2022-10 unverdicted novelty 7.0

    Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.

  18. DreamFusion: Text-to-3D using 2D Diffusion

    cs.CV 2022-09 accept novelty 7.0

    Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.

  19. L2P: Unlocking Latent Potential for Pixel Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

  20. FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    FlashAR achieves up to 22.9x speedup in 512x512 autoregressive image generation by post-training a pre-trained model with a complementary vertical head and dynamic fusion using only 0.05% of original training data.

  21. FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    FlashAR accelerates autoregressive image generation up to 22.9x by post-training a pre-trained raster-scan model with a complementary vertical head and dynamic fusion for two-way next-token prediction.

  22. CASCADE: Context-Aware Relaxation for Speculative Image Decoding

    cs.CV 2026-05 unverdicted novelty 6.0

    CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...

  23. Threshold-Guided Optimization for Visual Generative Models

    cs.LG 2026-05 unverdicted novelty 6.0

    A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.

  24. Visual Implicit Autoregressive Modeling

    cs.CV 2026-05 unverdicted novelty 6.0

    VIAR embeds implicit equilibrium layers in visual autoregressive models to achieve ImageNet FID 2.16 with 38.4% of VAR parameters and controllable inference compute.

  25. ViPO: Visual Preference Optimization at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    Poly-DPO improves robustness to noisy preference data in visual models, and the new ViPO dataset enables superior performance, with the method reducing to standard DPO on high-quality data.

  26. VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

    cs.CV 2026-04 unverdicted novelty 6.0

    VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.

  27. Normalizing Flows with Iterative Denoising

    cs.CV 2026-04 unverdicted novelty 6.0

    iTARFlow augments normalizing flows with diffusion-style iterative denoising during sampling while preserving end-to-end likelihood training, reaching competitive results on ImageNet 64/128/256.

  28. Closed-Form Concept Erasure via Double Projections

    cs.LG 2026-04 unverdicted novelty 6.0

    A training-free double-projection linear transformation erases target concepts from generative models by computing a proxy projection then applying a constrained update in the left null space of known directions.

  29. FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    cs.GR 2025-06 unverdicted novelty 6.0

    FLUX.1 Kontext unifies image generation and editing via flow matching and sequence concatenation, delivering improved multi-turn consistency and speed on the new KontextBench benchmark.

  30. Emu3: Next-Token Prediction is All You Need

    cs.CV 2024-09 unverdicted novelty 6.0

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  31. Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    cs.AI 2024-08 unverdicted novelty 6.0

    A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.

  32. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    cs.CV 2024-03 conditional novelty 6.0

    Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.

  33. VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    cs.CV 2023-10 unverdicted novelty 6.0

    Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.

  34. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    cs.CV 2023-08 unverdicted novelty 6.0

    IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.

  35. Make-A-Video: Text-to-Video Generation without Text-Video Data

    cs.CV 2022-09 unverdicted novelty 6.0

    Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.

  36. Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

    cs.LG 2026-05 unverdicted novelty 5.0

    Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.

  37. ACPO: Anchor-Constrained Perceptual Optimization for Diffusion Models with No-Reference Quality Guidance

    cs.CV 2026-04 unverdicted novelty 5.0

    ACPO uses anchor-based regularization with NR-IQA guidance to enable stable perceptual quality improvements in diffusion model fine-tuning.

  38. Galactica: A Large Language Model for Science

    cs.CL 2022-11 unverdicted novelty 5.0

    Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

  39. Qwen-Image-2.0 Technical Report

    cs.CV 2026-05 unverdicted novelty 4.0

    Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · cited by 38 Pith papers · 14 internal anchors

  1. [1]

    Introducing pathways: A next-generation ai architecture, 2021

    Jeff Dean. Introducing pathways: A next-generation ai architecture, 2021. 28

  2. [2]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021

  3. [3]

    Cogview: Mastering text-to-image generation via transformers

    Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34, 2021

  4. [4]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  5. [5]

    Discrete variational autoencoders

    Jason Tyler Rolfe. Discrete variational autoencoders. arXiv preprint arXiv:1609.02200, 2016

  6. [6]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017

  7. [7]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018

  8. [8]

    Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021

  9. [9]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machi...

  10. [10]

    Make-a-scene: Scene-based text-to-image generation with human priors

    Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131, 2022

  11. [11]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mc- Grew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021

  12. [12]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

  13. [13]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to- image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022

  14. [14]

    Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

  15. [15]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 2021

  16. [16]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014

  17. [17]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott...

  18. [18]

    LaMDA: Language Models for Dialog Applications

    Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent...

  19. [19]

    arXiv preprint arXiv:2112.06905 , year =

    Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. arXiv preprint arXiv:2112.06905, 2021

  20. [21]

    Y ., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y ., Baldridge, J., and Wu, Y

    Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021

  21. [22]

    Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss

    Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, and Shankar Kumar. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7829–7833. IEEE, 2020

  22. [23]

    Conformer: Convolution- augmented transformer for speech recognition,

    Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020

  23. [24]

    R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade, G., Lu, Y ., et al

    Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V . Le. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977, 2020

  24. [25]

    Coca: Contrastive captioners are image- text foundation models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022

  25. [26]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021

  26. [27]

    Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019

  27. [28]

    Gspmd: general and scalable parallelization for ml computation graphs

    Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, et al. Gspmd: general and scalable parallelization for ml computation graphs. arXiv preprint arXiv:2105.04663, 2021

  28. [29]

    Connecting vision and language with localized narratives

    Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari. Connecting vision and language with localized narratives. In ECCV, 2020

  29. [30]

    Generative pretraining from pixels

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691–1703. PMLR, 2020

  30. [31]

    Wide activation for efficient and accurate image super-resolution

    Jiahui Yu, Yuchen Fan, Jianchao Yang, Ning Xu, Zhaowen Wang, Xinchao Wang, and Thomas Huang. Wide activation for efficient and accurate image super-resolution. arXiv preprint arXiv:1808.08718, 2018. 30

  31. [32]

    Neural Machine Translation of Rare Words with Subword Units

    Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015

  32. [33]

    S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

    Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018

  33. [34]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

  34. [35]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019

  35. [36]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  36. [37]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021

  37. [38]

    Classifier free guidance for autoregressive transformers

    Katherine Crowson. Classifier free guidance for autoregressive transformers. 2021

  38. [39]

    Lingvo: a modular and scalable framework for sequence-to-sequence modeling

    Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, et al. Lingvo: a modular and scalable framework for sequence-to-sequence modeling. arXiv preprint arXiv:1902.08295, 2019

  39. [40]

    Le, Yonghui Wu, and Zhifeng Chen

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V . Le, Yonghui Wu, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism, 2018

  40. [41]

    Efficient large-scale language model training on gpu clusters using megatron-lm, 2021

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021

  41. [42]

    Adafactor: Adaptive learning rates with sublinear memory cost

    Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018

  42. [43]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021

  43. [44]

    Scaling vision trans- formers, 2021

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision trans- formers, 2021

  44. [45]

    arXiv preprint arXiv:2108.10904 , year=

    Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021

  45. [46]

    Text-to-image generation grounded by fine-grained user attention

    Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Text-to-image generation grounded by fine-grained user attention. WACV, 2021

  46. [47]

    Cross-modal contrastive learning for text-to-image generation

    Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 833–842, 2021

  47. [48]

    Benchmark for compositional text-to-image synthesis

    Dong Huk Park, Samaneh Azadi, Xihui Liu, Trevor Darrell, and Anna Rohrbach. Benchmark for compositional text-to-image synthesis. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021

  48. [49]

    Vector quantized diffusion model for text-to-image synthesis

    Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. arXiv preprint arXiv:2111.14822, 2021

  49. [50]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021

  50. [51]

    Accelerating large-scale inference with anisotropic vector quantization

    Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Ku- mar. Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, 2020. 31

  51. [52]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017

  52. [53]

    Re- thinking the inception architecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

  53. [54]

    AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks

    Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018

  54. [55]

    Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers

    Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. 2022

  55. [56]

    Unifying vision-and-language tasks via text generation

    Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In ICML, 2021

  56. [57]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002

  57. [58]

    Lawrence Zitnick, and Devi Parikh

    Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In CVPR, 2015

  58. [59]

    Meteor universal: Language specific translation evalua- tion for any target language

    Michael Denkowski and Alon Lavie. Meteor universal: Language specific translation evalua- tion for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 376–380, Baltimore, Maryland, USA, June 2014. Association for Computa- tional Linguistics

  59. [60]

    SPICE: Semantic Propositional Image Caption Evaluation

    Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: Semantic Propositional Image Caption Evaluation. In ECCV, 2016

  60. [61]

    Cogview2: Faster and better text-to- image generation via hierarchical transformers, 2022

    Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to- image generation via hierarchical transformers, 2022

  61. [62]

    mindall-e on conceptual captions

    Chiheon Kim Doyup Lee Saehoon Kim, Sanghun Cho and Woonhyuk Baek. mindall-e on conceptual captions. https://github.com/kakaobrain/minDALL-E, 2021

  62. [63]

    X-lxmert: Paint, caption and answer questions with multi-modal transformers

    Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, and Aniruddha Kembhavi. X-lxmert: Paint, caption and answer questions with multi-modal transformers. ArXiv, abs/2009.11278, 2020

  63. [64]

    Deep visual-semantic alignments for generating image descriptions

    Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:664–676, 2017

  64. [65]

    How to marry a star: Probabilistic constraints for meaning in context

    Katrin Erk and Aurélie Herbelot. How to marry a star: Probabilistic constraints for meaning in context. In Proceedings of the Society for Computation in Linguistics 2021, pages 451–453, Online, February 2021. Association for Computational Linguistics

  65. [66]

    Wordseye: an automatic text-to-scene conversion system

    Bob Coyne and Richard Sproat. Wordseye: an automatic text-to-scene conversion system. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, 2001

  66. [67]

    Generative adversarial text to image synthesis

    Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International conference on machine learning, pages 1060–1069. PMLR, 2016

  67. [68]

    StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks

    Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017

  68. [69]

    Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N. Metaxas. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. TPAMI, 2018

  69. [70]

    Learning what and where to draw

    Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. Learning what and where to draw. NeurIPS, 29, 2016

  70. [71]

    Inferring semantic layout for hierarchical text-to-image synthesis

    Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. Inferring semantic layout for hierarchical text-to-image synthesis. In CVPR, 2018. 32

  71. [72]

    Generating multiple objects at spatially distinct locations

    Tobias Hinz, Stefan Heinrich, and Stefan Wermter. Generating multiple objects at spatially distinct locations. In ICLR, 2019

  72. [73]

    Dall·e mini, 7 2021

    Boris Dayma, Suraj Patil, Pedro Cuenca, Khalid Saifullah, Tanishq Abraham, Phuc Le Khac, Luke Melas, and Ritobrata Ghosh. Dall·e mini, 7 2021

  73. [74]

    MaskGIT: Masked Generative Image Transformer,

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. arXiv preprint arXiv:2202.04200, 2022

  74. [75]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  75. [76]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

  76. [77]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021

  77. [78]

    Cascaded diffusion models for high fidelity image generation

    Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 23(47):1–33, 2022

  78. [79]

    Analyzing and improving the image quality of stylegan

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020

  79. [80]

    Perceptual losses for real-time style transfer and super-resolution

    Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer, 2016

  80. [81]

    The unrea- sonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

Showing first 80 references.