arxiv: 2408.11039 · v1 · submitted 2024-08-20 · 💻 cs.AI · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Arun Babu, Chunting Zhou, Jacob Kahn, Kushal Tirumala, Leonid Shamis, Lili Yu, Luke Zettlemoyer, Michihiro Yasunaga, Omer Levy, Xuezhe Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:54 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords multi-modal modelsdiffusion modelslanguage modelingtransformerimage generationnext token predictionscaling lawsjoint training

0 comments

The pith

Transfusion trains one transformer on mixed text and image sequences by combining next-token prediction with diffusion losses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Transfusion as a method to train a single transformer model over sequences that mix discrete text tokens and continuous image patches. It applies the standard language modeling loss to text while using a diffusion objective on images within the same forward and backward passes. Experiments across model sizes demonstrate that this unified loss produces better scaling behavior than first converting images into discrete tokens for a pure language model. Adding separate encoding and decoding layers for each modality further improves results and allows images to be represented with only 16 patches. At the 7B scale trained on 2T tokens the resulting model performs competitively on both text and image generation benchmarks.

Core claim

Transfusion combines the language modeling loss function with diffusion to train a single transformer over mixed-modality sequences, establishing scaling laws and reaching performance on par with separately trained language models and diffusion models when scaled to 7B parameters and 2T tokens.

What carries the argument

Joint optimization of next-token language modeling loss on text and diffusion loss on image patches inside one transformer, optionally augmented by modality-specific encoding and decoding layers.

If this is right

The combined loss scales better than language modeling over quantized image tokens across uni-modal and cross-modal tasks.
Modality-specific encoding and decoding layers improve performance and permit extreme image compression to 16 patches.
At 7B parameters the single model matches the generation quality of specialized diffusion models for images and language models for text.
Training on 2T mixed multi-modal tokens produces competitive results without separate modality-specific architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint-loss recipe could be tested on additional continuous modalities such as audio by swapping the diffusion component.
Unified training may reduce the engineering overhead of maintaining separate text and image model families for downstream applications.
The observed lack of interference at current scales invites direct measurement of gradient alignment between the two loss terms during training.

Load-bearing premise

That the language modeling and diffusion objectives can be optimized together in the same transformer without substantial conflicts or negative interference between the two modalities.

What would settle it

A controlled scaling run at 7B or larger parameters where the Transfusion model falls noticeably behind matched-scale separate language and diffusion models on standard text and image generation benchmarks.

read the original abstract

We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data. Transfusion combines the language modeling loss function (next token prediction) with diffusion to train a single transformer over mixed-modality sequences. We pretrain multiple Transfusion models up to 7B parameters from scratch on a mixture of text and image data, establishing scaling laws with respect to a variety of uni- and cross-modal benchmarks. Our experiments show that Transfusion scales significantly better than quantizing images and training a language model over discrete image tokens. By introducing modality-specific encoding and decoding layers, we can further improve the performance of Transfusion models, and even compress each image to just 16 patches. We further demonstrate that scaling our Transfusion recipe to 7B parameters and 2T multi-modal tokens produces a model that can generate images and text on a par with similar scale diffusion models and language models, reaping the benefits of both worlds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Transfusion shows you can mix next-token prediction with diffusion inside one transformer on text-image sequences and get better scaling than quantizing images to tokens.

read the letter

The core result is that a shared transformer trained on interleaved text and image data using both the language modeling loss and a diffusion objective scales up to 7B parameters on 2T tokens without obvious collapse. Adding modality-specific encoding and decoding layers further helps, and they reach usable image generation while compressing each image to 16 patches. The model ends up competitive with separate LMs and diffusion models on their home benchmarks. What stands out is the empirical demonstration that the combined objective does not produce major negative interference at this scale, plus the direct comparison showing an advantage over the discrete-token baseline. The scaling curves across uni- and cross-modal tasks are the most useful part of the work. The soft spots are the usual ones for a methods paper at this stage: limited visibility into exact loss weighting between the two objectives, the diffusion implementation details inside the transformer, and the absence of error bars or fuller ablations on the modality layers. Those gaps make it harder to judge robustness or to reproduce the exact gains, but the reported runs themselves look consistent and do not suggest hidden contradictions. This is aimed at groups already building or scaling multi-modal models who want a simpler architecture than separate encoders or heavy quantization. It is worth sending to peer review because the experiments are large enough and the central claim is straightforward to test or extend.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Transfusion, a training recipe for a single transformer model that processes mixed sequences of discrete text tokens and continuous image data. It combines the standard language modeling loss (next-token prediction) with a diffusion objective for images. The authors pretrain models scaling up to 7B parameters on 2 trillion multi-modal tokens, establish empirical scaling laws across uni- and cross-modal tasks, and show that Transfusion outperforms approaches that quantize images into discrete tokens. They further introduce modality-specific encoding and decoding layers that allow compressing each image to only 16 patches while improving performance, and demonstrate that the 7B model generates text and images competitively with specialized models of similar scale.

Significance. If the reported results hold under full experimental disclosure, this work provides empirical support for a unified multi-modal architecture that jointly trains on discrete and continuous data without separate modality backbones. The scaling observations up to 7B/2T tokens and the competitive generation quality against specialized diffusion and language models would be a useful data point for the community, particularly the reported ability to compress images to 16 patches via modality-specific layers.

major comments (3)

[Abstract and §3 (Method)] The abstract states that Transfusion combines the language modeling loss with diffusion, yet provides no specification of the loss weighting between the two objectives or the exact diffusion implementation (e.g., noise schedule, timestep embedding within the shared transformer, or how the diffusion loss is computed over image patches). This detail is load-bearing for the central claim that joint training produces no substantial optimization conflicts.
[§4 (Experiments) and scaling plots] The scaling laws and benchmark comparisons in the experiments claim that Transfusion 'scales significantly better' than discrete image tokenization, but the manuscript reports no error bars, multiple random seeds, or ablation controls on hyperparameters such as loss weighting. Without these, the magnitude and reliability of the improvement cannot be assessed.
[§5 (Modality-specific layers) and results tables] The performance gains from modality-specific encoding/decoding layers and the compression of each image to 16 patches are presented as key results, but the architecture, initialization, and training details of these layers are not described. These are listed as free parameters whose choices directly affect the reported cross-modal benchmarks.

minor comments (2)

[§3 (Method)] Notation for mixed-modality sequences (e.g., how text tokens and image patches are interleaved) would benefit from an explicit example or diagram in the method section.
[Figures in §4] Figure captions for scaling plots should explicitly state the evaluation metrics, number of runs, and what baselines are included for each curve.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have revised the paper to address the concerns about missing implementation details, added clarifications and controls to the experiments where feasible, and expanded descriptions of the modality-specific layers to improve reproducibility and assess the reliability of our claims.

read point-by-point responses

Referee: [Abstract and §3 (Method)] The abstract states that Transfusion combines the language modeling loss with diffusion, yet provides no specification of the loss weighting between the two objectives or the exact diffusion implementation (e.g., noise schedule, timestep embedding within the shared transformer, or how the diffusion loss is computed over image patches). This detail is load-bearing for the central claim that joint training produces no substantial optimization conflicts.

Authors: We agree that the original submission omitted critical implementation details required for reproducibility and for evaluating potential optimization conflicts between objectives. In the revised manuscript, Section 3 now explicitly states that we use equal weighting (λ_LM = 1, λ_diff = 1) between the standard next-token cross-entropy loss and the diffusion loss. The diffusion process follows the DDPM formulation with a linear noise schedule (β from 0.0001 to 0.02 over 1000 timesteps). Timestep embeddings are generated via sinusoidal encoding and added to the image patch embeddings before the shared transformer. The diffusion loss is computed as the mean squared error on the predicted noise for each image patch independently, then averaged across patches and the batch. These additions directly support our claim of stable joint training without substantial conflicts. revision: yes
Referee: [§4 (Experiments) and scaling plots] The scaling laws and benchmark comparisons in the experiments claim that Transfusion 'scales significantly better' than discrete image tokenization, but the manuscript reports no error bars, multiple random seeds, or ablation controls on hyperparameters such as loss weighting. Without these, the magnitude and reliability of the improvement cannot be assessed.

Authors: We acknowledge that the lack of error bars, multiple seeds, and hyperparameter ablations weakens the strength of the scaling claims. Due to the prohibitive cost of retraining multiple 7B-scale models, we could not rerun the largest experiments. In the revision we have added error bars (standard deviation over 3 seeds) for all models up to 1B parameters, included an appendix ablation varying the loss weighting ratio from 0.5 to 2.0 (showing the advantage persists), and softened the language from 'scales significantly better' to 'scales better' in the main text while noting the single-run limitation for the largest scales. These changes improve transparency without altering the core empirical observations. revision: partial
Referee: [§5 (Modality-specific layers) and results tables] The performance gains from modality-specific encoding/decoding layers and the compression of each image to 16 patches are presented as key results, but the architecture, initialization, and training details of these layers are not described. These are listed as free parameters whose choices directly affect the reported cross-modal benchmarks.

Authors: We thank the referee for highlighting this omission. The revised Section 5 now fully describes the layers: each modality-specific encoder is a 2-layer MLP (hidden size equal to model dimension, GELU activations) that projects raw image patches into the transformer embedding space; the decoder is a symmetric 2-layer MLP that maps transformer outputs to the diffusion prediction space. Both are initialized with Xavier uniform initialization and trained jointly from scratch with the shared transformer. We also specify that the 16-patch compression corresponds to a 4×4 spatial downsampling per image (with appropriate patch size adjustment) and confirm these choices were held fixed across the reported benchmarks. These details have been added to the main text and appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical recipe for joint next-token prediction and diffusion training in a single transformer, with results from direct pretraining runs up to 7B parameters on 2T multi-modal tokens. Scaling laws and performance comparisons (including to quantized-image LM baselines and standalone diffusion models) are established experimentally rather than derived from equations or definitions that reduce to the inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citation chains appear in the methodology or claims. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach relies on standard transformer and diffusion assumptions rather than new theoretical derivations. No new physical entities are postulated.

free parameters (2)

loss weighting between next-token and diffusion objectives
Hyperparameter balancing the two losses during joint training; value not specified in abstract but required for the mixed objective.
modality-specific layer architecture and initialization
Design choices for encoding/decoding layers that enable 16-patch compression; learned parameters tuned during pretraining.

axioms (1)

domain assumption Diffusion can be applied directly to image patches within a shared transformer sequence without architectural incompatibility
Invoked when stating that continuous image data can be processed alongside discrete text tokens in one model.

pith-pipeline@v0.9.0 · 5497 in / 1455 out tokens · 35551 ms · 2026-05-13T05:54:04.973009+00:00 · methodology

discussion (0)

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
cs.CV 2026-05 unverdicted novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 7.0

MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
cs.CL 2026-04 unverdicted novelty 7.0

LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

Process-driven image generation decomposes text-to-image synthesis into interleaved cycles of textual planning, visual drafting, textual reflection, and visual refinement with dense consistency supervision.
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
cs.CV 2026-05 unverdicted novelty 6.0

STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
cs.CV 2026-05 unverdicted novelty 6.0

MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging
cs.CV 2026-04 unverdicted novelty 6.0

CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.
Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs
cs.CV 2026-04 unverdicted novelty 6.0

IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.
Generative Refinement Networks for Visual Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
Counting to Four is still a Chore for VLMs
cs.CV 2026-04 unverdicted novelty 6.0

VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
cs.RO 2025-04 unverdicted novelty 6.0

Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
cs.LG 2024-10 unverdicted novelty 6.0

π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
WorldVLA: Towards Autoregressive Action World Model
cs.RO 2025-06 unverdicted novelty 5.0

WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
Emerging Properties in Unified Multimodal Pretraining
cs.CV 2025-05 unverdicted novelty 5.0

BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
cs.CV 2025-05 conditional novelty 5.0

BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
cs.CV 2026-04 unverdicted novelty 4.0

MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
cs.AI 2025-01 conditional novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems
eess.SY 2026-04 unverdicted novelty 2.0

A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 26 Pith papers · 15 internal anchors

[1]

Lumiere: A space-time diffusion model for video generation,

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945,

work page arXiv
[2]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He

URL https://arxiv.org/abs/2102.08981. Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297,

work page arXiv 2003
[5]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044,

work page internal anchor Pith review arXiv 1905
[6]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

14 Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Emu: Enhanc- ing image generation models using photogenic needles in a haystack

Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, et al. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807,

work page arXiv
[8]

Dreamllm: Synergistic multimodal com- prehension and creation

Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499,

work page arXiv
[9]

The Llama 3 Herd of Models

URL https://arxiv.org/abs/2407.21783. Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024a. Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari,...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Diffusion-lm improves controllable text generation

Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. Diffusion-lm improves controllable text generation. ArXiv, abs/2205.14217,

work page arXiv
[14]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Llava-plus: Learning to use tools for creating multi- modal agents

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437,

work page arXiv
[16]

GPT-4 Technical Report

URL https://arxiv.org/abs/2303.08774. Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

URL http://arxiv.org/abs/1910.10683. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[20]

SocialIQA: Commonsense Reasoning about Social Interactions

18 Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728,

work page internal anchor Pith review arXiv 1904
[21]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[22]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay ...

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

arXiv:2309.02591 , year=

Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, et al. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591,

work page arXiv
[25]

Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL-2019)

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL-2019). Association for Computational Linguistics,

work page 2019
[26]

[2021]: LV AE= L1 + LLPIPS + 0.5LGAN + 0.2LID + 0.000001LKL where L1 is L1 loss in pixel space, LLPIPS is perceptual loss based on LPIPS similarity Zhang et al

19 A Autoencoder Details The training objective for our V AE closely follows that of Esser et al. [2021]: LV AE= L1 + LLPIPS + 0.5LGAN + 0.2LID + 0.000001LKL where L1 is L1 loss in pixel space, LLPIPS is perceptual loss based on LPIPS similarity Zhang et al. [2018], LGAN is a patch-based discriminator loss, LID is a perceptual loss based on internal featu...

work page 2021
[27]

Diffu- sion

The training objective for the VQ-GAN matches that of the V AE, with one notable exception: we replace the LKL loss with the standard codebook commitment loss Lcodebook [Van Den Oord et al., 2017], which encourages encoder outputs and codebook vectors to be close together. We use β = 0.25, and use loss weighting 1.0. The final loss function for the VQ-V A...

work page 2017