Recognition: 2 theorem links
· Lean TheoremTransfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Pith reviewed 2026-05-13 05:54 UTC · model grok-4.3
The pith
Transfusion trains one transformer on mixed text and image sequences by combining next-token prediction with diffusion losses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Transfusion combines the language modeling loss function with diffusion to train a single transformer over mixed-modality sequences, establishing scaling laws and reaching performance on par with separately trained language models and diffusion models when scaled to 7B parameters and 2T tokens.
What carries the argument
Joint optimization of next-token language modeling loss on text and diffusion loss on image patches inside one transformer, optionally augmented by modality-specific encoding and decoding layers.
If this is right
- The combined loss scales better than language modeling over quantized image tokens across uni-modal and cross-modal tasks.
- Modality-specific encoding and decoding layers improve performance and permit extreme image compression to 16 patches.
- At 7B parameters the single model matches the generation quality of specialized diffusion models for images and language models for text.
- Training on 2T mixed multi-modal tokens produces competitive results without separate modality-specific architectures.
Where Pith is reading between the lines
- The same joint-loss recipe could be tested on additional continuous modalities such as audio by swapping the diffusion component.
- Unified training may reduce the engineering overhead of maintaining separate text and image model families for downstream applications.
- The observed lack of interference at current scales invites direct measurement of gradient alignment between the two loss terms during training.
Load-bearing premise
That the language modeling and diffusion objectives can be optimized together in the same transformer without substantial conflicts or negative interference between the two modalities.
What would settle it
A controlled scaling run at 7B or larger parameters where the Transfusion model falls noticeably behind matched-scale separate language and diffusion models on standard text and image generation benchmarks.
read the original abstract
We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data. Transfusion combines the language modeling loss function (next token prediction) with diffusion to train a single transformer over mixed-modality sequences. We pretrain multiple Transfusion models up to 7B parameters from scratch on a mixture of text and image data, establishing scaling laws with respect to a variety of uni- and cross-modal benchmarks. Our experiments show that Transfusion scales significantly better than quantizing images and training a language model over discrete image tokens. By introducing modality-specific encoding and decoding layers, we can further improve the performance of Transfusion models, and even compress each image to just 16 patches. We further demonstrate that scaling our Transfusion recipe to 7B parameters and 2T multi-modal tokens produces a model that can generate images and text on a par with similar scale diffusion models and language models, reaping the benefits of both worlds.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Transfusion, a training recipe for a single transformer model that processes mixed sequences of discrete text tokens and continuous image data. It combines the standard language modeling loss (next-token prediction) with a diffusion objective for images. The authors pretrain models scaling up to 7B parameters on 2 trillion multi-modal tokens, establish empirical scaling laws across uni- and cross-modal tasks, and show that Transfusion outperforms approaches that quantize images into discrete tokens. They further introduce modality-specific encoding and decoding layers that allow compressing each image to only 16 patches while improving performance, and demonstrate that the 7B model generates text and images competitively with specialized models of similar scale.
Significance. If the reported results hold under full experimental disclosure, this work provides empirical support for a unified multi-modal architecture that jointly trains on discrete and continuous data without separate modality backbones. The scaling observations up to 7B/2T tokens and the competitive generation quality against specialized diffusion and language models would be a useful data point for the community, particularly the reported ability to compress images to 16 patches via modality-specific layers.
major comments (3)
- [Abstract and §3 (Method)] The abstract states that Transfusion combines the language modeling loss with diffusion, yet provides no specification of the loss weighting between the two objectives or the exact diffusion implementation (e.g., noise schedule, timestep embedding within the shared transformer, or how the diffusion loss is computed over image patches). This detail is load-bearing for the central claim that joint training produces no substantial optimization conflicts.
- [§4 (Experiments) and scaling plots] The scaling laws and benchmark comparisons in the experiments claim that Transfusion 'scales significantly better' than discrete image tokenization, but the manuscript reports no error bars, multiple random seeds, or ablation controls on hyperparameters such as loss weighting. Without these, the magnitude and reliability of the improvement cannot be assessed.
- [§5 (Modality-specific layers) and results tables] The performance gains from modality-specific encoding/decoding layers and the compression of each image to 16 patches are presented as key results, but the architecture, initialization, and training details of these layers are not described. These are listed as free parameters whose choices directly affect the reported cross-modal benchmarks.
minor comments (2)
- [§3 (Method)] Notation for mixed-modality sequences (e.g., how text tokens and image patches are interleaved) would benefit from an explicit example or diagram in the method section.
- [Figures in §4] Figure captions for scaling plots should explicitly state the evaluation metrics, number of runs, and what baselines are included for each curve.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We have revised the paper to address the concerns about missing implementation details, added clarifications and controls to the experiments where feasible, and expanded descriptions of the modality-specific layers to improve reproducibility and assess the reliability of our claims.
read point-by-point responses
-
Referee: [Abstract and §3 (Method)] The abstract states that Transfusion combines the language modeling loss with diffusion, yet provides no specification of the loss weighting between the two objectives or the exact diffusion implementation (e.g., noise schedule, timestep embedding within the shared transformer, or how the diffusion loss is computed over image patches). This detail is load-bearing for the central claim that joint training produces no substantial optimization conflicts.
Authors: We agree that the original submission omitted critical implementation details required for reproducibility and for evaluating potential optimization conflicts between objectives. In the revised manuscript, Section 3 now explicitly states that we use equal weighting (λ_LM = 1, λ_diff = 1) between the standard next-token cross-entropy loss and the diffusion loss. The diffusion process follows the DDPM formulation with a linear noise schedule (β from 0.0001 to 0.02 over 1000 timesteps). Timestep embeddings are generated via sinusoidal encoding and added to the image patch embeddings before the shared transformer. The diffusion loss is computed as the mean squared error on the predicted noise for each image patch independently, then averaged across patches and the batch. These additions directly support our claim of stable joint training without substantial conflicts. revision: yes
-
Referee: [§4 (Experiments) and scaling plots] The scaling laws and benchmark comparisons in the experiments claim that Transfusion 'scales significantly better' than discrete image tokenization, but the manuscript reports no error bars, multiple random seeds, or ablation controls on hyperparameters such as loss weighting. Without these, the magnitude and reliability of the improvement cannot be assessed.
Authors: We acknowledge that the lack of error bars, multiple seeds, and hyperparameter ablations weakens the strength of the scaling claims. Due to the prohibitive cost of retraining multiple 7B-scale models, we could not rerun the largest experiments. In the revision we have added error bars (standard deviation over 3 seeds) for all models up to 1B parameters, included an appendix ablation varying the loss weighting ratio from 0.5 to 2.0 (showing the advantage persists), and softened the language from 'scales significantly better' to 'scales better' in the main text while noting the single-run limitation for the largest scales. These changes improve transparency without altering the core empirical observations. revision: partial
-
Referee: [§5 (Modality-specific layers) and results tables] The performance gains from modality-specific encoding/decoding layers and the compression of each image to 16 patches are presented as key results, but the architecture, initialization, and training details of these layers are not described. These are listed as free parameters whose choices directly affect the reported cross-modal benchmarks.
Authors: We thank the referee for highlighting this omission. The revised Section 5 now fully describes the layers: each modality-specific encoder is a 2-layer MLP (hidden size equal to model dimension, GELU activations) that projects raw image patches into the transformer embedding space; the decoder is a symmetric 2-layer MLP that maps transformer outputs to the diffusion prediction space. Both are initialized with Xavier uniform initialization and trained jointly from scratch with the shared transformer. We also specify that the 16-patch compression corresponds to a 4×4 spatial downsampling per image (with appropriate patch size adjustment) and confirm these choices were held fixed across the reported benchmarks. These details have been added to the main text and appendix. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an empirical recipe for joint next-token prediction and diffusion training in a single transformer, with results from direct pretraining runs up to 7B parameters on 2T multi-modal tokens. Scaling laws and performance comparisons (including to quantized-image LM baselines and standalone diffusion models) are established experimentally rather than derived from equations or definitions that reduce to the inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citation chains appear in the methodology or claims. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- loss weighting between next-token and diffusion objectives
- modality-specific layer architecture and initialization
axioms (1)
- domain assumption Diffusion can be applied directly to image patches within a shared transformer sequence without architectural incompatibility
Forward citations
Cited by 26 Pith papers
-
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
-
Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models
MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.
-
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
-
Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
Process-driven image generation decomposes text-to-image synthesis into interleaved cycles of textual planning, visual drafting, textual reflection, and visual refinement with dense consistency supervision.
-
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
-
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
-
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
-
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging
CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.
-
Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs
IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.
-
Generative Refinement Networks for Visual Synthesis
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
-
Counting to Four is still a Chore for VLMs
VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.
-
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
-
CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.
-
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
-
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
-
Emu3: Next-Token Prediction is All You Need
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
-
WorldVLA: Towards Autoregressive Action World Model
WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
-
Emerging Properties in Unified Multimodal Pretraining
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
-
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
-
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
-
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems
A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.
Reference graph
Works this paper leans on
-
[1]
Lumiere: A space-time diffusion model for video generation,
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945,
-
[2]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He
URL https://arxiv.org/abs/2102.08981. Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297,
-
[5]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044,
work page internal anchor Pith review arXiv 1905
-
[6]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
14 Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Emu: Enhanc- ing image generation models using photogenic needles in a haystack
Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, et al. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807,
-
[8]
Dreamllm: Synergistic multimodal com- prehension and creation
Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499,
-
[9]
URL https://arxiv.org/abs/2407.21783. Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024a. Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari,...
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Diffusion-lm improves controllable text generation
Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. Diffusion-lm improves controllable text generation. ArXiv, abs/2205.14217,
-
[14]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Llava-plus: Learning to use tools for creating multi- modal agents
Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437,
-
[16]
URL https://arxiv.org/abs/2303.08774. Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
URL http://arxiv.org/abs/1910.10683. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[20]
SocialIQA: Commonsense Reasoning about Social Interactions
18 Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728,
work page internal anchor Pith review arXiv 1904
-
[21]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202,
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[22]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, et al. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591,
-
[25]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL-2019). Association for Computational Linguistics,
work page 2019
-
[26]
19 A Autoencoder Details The training objective for our V AE closely follows that of Esser et al. [2021]: LV AE= L1 + LLPIPS + 0.5LGAN + 0.2LID + 0.000001LKL where L1 is L1 loss in pixel space, LLPIPS is perceptual loss based on LPIPS similarity Zhang et al. [2018], LGAN is a patch-based discriminator loss, LID is a perceptual loss based on internal featu...
work page 2021
-
[27]
The training objective for the VQ-GAN matches that of the V AE, with one notable exception: we replace the LKL loss with the standard codebook commitment loss Lcodebook [Van Den Oord et al., 2017], which encourages encoder outputs and codebook vectors to be close together. We use β = 0.25, and use loss weighting 1.0. The final loss function for the VQ-V A...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.