arxiv: 2604.19902 · v1 · submitted 2026-04-21 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

Zijie Li , Yichun Shi , Jingxiang Sun , Ye Wang , Yixuan Huang , Zhiyao Guo , Xiaochen Lian , Peihao Zhu

show 3 more authors

Yu Tian Zhonghua Zhai Peng Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords multimodal image generationvision-language modelsdiffusion modelsimage editinglearnable query tokenstext-to-image synthesissemantic embeddings

0 comments

The pith

Learnable query tokens in a frozen vision-language model extract semantic embeddings that condition a diffusion model for multimodal image generation and editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method that uses a pre-trained vision-language model to produce conditioning signals for a diffusion-based generator. Learnable query tokens pull out semantic visual information without retraining the VLM or building deep connections between model types. This design aims to carry over the model's ability to reason about text, images, and spatial relations into the generation process. The result is a simpler pipeline that the authors show works on text-to-image creation as well as single- and multi-image editing tasks. A reader would care because it lowers the cost of combining understanding models with synthesis while claiming better results than prior approaches on standard tests.

Core claim

MMCORE shows that semantic visual embeddings predicted by learnable query tokens inside a frozen VLM can serve directly as conditioning signals for a diffusion model, enabling a single framework to handle both text-to-image synthesis and interleaved image editing while preserving high fidelity and reducing the need for deep fusion or training from scratch.

What carries the argument

Learnable query tokens that extract aligned semantic visual embeddings from a frozen VLM to condition the diffusion model.

If this is right

The same conditioning pathway supports both pure text-to-image generation and editing operations that interleave reference images.
Complex tasks such as spatial reasoning and visual grounding become tractable without separate training stages.
Computational cost drops because the VLM stays frozen and no deep autoregressive-diffusion fusion is required.
Performance exceeds prior baselines on a range of text-to-image and single- or multi-image editing benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future improvements to the underlying VLM could be plugged in directly to upgrade generation quality without retraining the diffusion component.
The query-token approach might extend to conditioning other generators, such as video or 3D diffusion models, using the same frozen VLM.
Training data efficiency could rise if the VLM's pre-existing knowledge reduces the volume of image-text pairs needed for the diffusion stage.

Load-bearing premise

The embeddings produced by the learnable query tokens contain enough semantic detail to guide the diffusion model correctly through complex spatial and visual-grounding cases.

What would settle it

A collection of spatial-reasoning or visual-grounding prompts where the generated images systematically fail to respect the intended object relations or scene layout despite the VLM correctly describing those relations.

read the original abstract

We present MMCORE, a unified framework designed for multimodal image generation and editing. MMCORE leverages a pre-trained Vision-Language Model (VLM) to predict semantic visual embeddings via learnable query tokens, which subsequently serve as conditioning signals for a diffusion model. This streamlined design effectively transfers the rich understanding and reasoning capabilities of VLMs into the visual generation process. By obviating the need for deep fusion between autoregressive and diffusion models or training from scratch, MMCORE significantly reduces computational overhead while maintaining high-fidelity synthesis. MMCORE seamlessly integrates text-to-image synthesis with interleaved image generation, demonstrating robust multimodal comprehension in complex scenarios such as spatial reasoning and visual grounding. Comprehensive evaluations indicate that MMCORE consistently outperforms state-of-the-art baselines across a broad spectrum of text-to-image and single/multi-image editing benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces MMCORE, a unified framework for multimodal image generation and editing. It uses a frozen pre-trained Vision-Language Model (VLM) with learnable query tokens to extract semantic visual embeddings that condition a diffusion model. This enables text-to-image synthesis, interleaved image generation, and single/multi-image editing tasks involving spatial reasoning and visual grounding. The design avoids deep fusion between autoregressive and diffusion components or training from scratch, with claims of reduced computational overhead and consistent outperformance over state-of-the-art baselines on relevant benchmarks.

Significance. If the empirical results hold under detailed scrutiny, the work illustrates that lightweight, query-token-based alignment can transfer VLM semantic and reasoning capabilities to diffusion models effectively. This streamlined approach offers a computationally lighter alternative to complex multimodal fusion architectures, with potential practical value for high-fidelity generation and editing. The focus on benchmark-driven evaluation provides a reproducible basis for comparison, which is a positive aspect of the contribution.

major comments (1)

[Evaluation section] Evaluation section: The central claim of consistent outperformance on text-to-image and single/multi-image editing benchmarks is not supported by any quantitative results, tables, specific metrics (e.g., FID, CLIP scores), baseline implementations, dataset details, or experimental controls. Without this evidence, the primary empirical assertion cannot be assessed or reproduced.

minor comments (3)

[Method] The method description would benefit from explicit equations or pseudocode detailing the optimization of the learnable query tokens and the precise mechanism for aligning VLM embeddings to the diffusion model's conditioning space.
[Method] Notation for embeddings, query tokens, and conditioning signals should be introduced consistently and defined upon first use to improve readability.
[Abstract] The abstract could briefly reference key benchmark names or metrics to give readers immediate context for the claimed improvements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: The central claim of consistent outperformance on text-to-image and single/multi-image editing benchmarks is not supported by any quantitative results, tables, specific metrics (e.g., FID, CLIP scores), baseline implementations, dataset details, or experimental controls. Without this evidence, the primary empirical assertion cannot be assessed or reproduced.

Authors: We agree that the submitted manuscript's evaluation section does not contain the required quantitative results, tables, metrics such as FID or CLIP scores, baseline details, dataset information, or experimental controls. This omission prevents proper assessment of the outperformance claims. In the revised version, we will add a complete evaluation section with all of these elements, including specific numbers, tables, and reproducibility details to support the claims made in the abstract and introduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical proposal of a VLM-query-to-diffusion pipeline for multimodal generation and editing. No mathematical derivation chain, equations, or first-principles predictions are presented. Performance claims rest on external benchmark comparisons rather than quantities defined or fitted from the method's own outputs. No self-definitional, fitted-input, or self-citation load-bearing reductions exist in the stated architecture or evaluation protocol.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach assumes standard pre-trained VLM and diffusion model capabilities plus the effectiveness of learnable query tokens; no new physical or mathematical axioms are introduced.

free parameters (1)

learnable query tokens
These are introduced as trainable parameters whose values are fitted during training to produce useful embeddings from the VLM.

axioms (2)

domain assumption Pre-trained VLMs contain transferable semantic visual understanding that can be extracted via query tokens
Invoked in the description of how embeddings are predicted and used as conditioning.
domain assumption Diffusion models can be effectively conditioned on VLM-derived embeddings without architectural overhaul
Central to the claim of streamlined design and reduced overhead.

pith-pipeline@v0.9.0 · 5472 in / 1292 out tokens · 52623 ms · 2026-05-10T03:23:47.517387+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 26 canonical work pages · 16 internal anchors

[1]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Ran Xu, et al. Blip3-o: A family of fully open unified multimodal models–architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

work page Pith review arXiv 2025
[2]

Pali-x: On scaling up a multilingual vision and language model

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023

work page arXiv 2023
[3]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review arXiv 2025
[4]

Is nano banana pro a low-level vision all-rounder? a comprehensive evaluation on 14 tasks and 40 datasets.arXiv preprint arXiv:2512.15110,

Google DeepMind. Is nano banana pro a low-level vision all-rounder? a comprehensive evaluation on 14 tasks and 40 datasets. arXiv preprint arXiv:2512.15110, 2025. URLhttps://arxiv.org/abs/2512.15110

work page arXiv 2025
[5]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review arXiv 2025
[6]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021

2021
[7]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICLR, 2024

2024
[8]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. InarXiv preprint arXiv:1706.02677, 2017

work page internal anchor Pith review arXiv 2017
[9]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025

2025
[10]

Denoising diffusion probabilistic models.NeurIPS, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.NeurIPS, 2020

2020
[11]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrill, Andrea Corrado, Sergei Vassilvitskii, D Li, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational Conference on Machine Learning (ICML), pages 2790–2799. PMLR, 2019

2019
[12]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022

2022
[13]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning (ICML), pages 4904–4916. PMLR, 2021. 13

2021
[14]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprintarXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[15]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024
[16]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational Conference on Machine Learning (ICML), pages 12888–12900. PMLR, 2022

2022
[17]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning (ICML), pages 19730–19742. PMLR, 2023

2023
[18]

Autoregressive image generation without vector quantization.Advancesin Neural Information Processing Systems, 37:56424–56445, 2024

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advancesin Neural Information Processing Systems, 37:56424–56445, 2024

2024
[19]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Li Yuan, et al. Uniworld-v1: High-resolution semantic encoders for unified visual under- standing and generation.arXiv preprint arXiv:2506.03147, 2025

work page internal anchor Pith review arXiv 2025
[20]

Gpt image 1.5 system card.https://platform.openai.com/docs/models/gpt-image-1-5, 2025

OpenAI. Gpt image 1.5 system card.https://platform.openai.com/docs/models/gpt-image-1-5, 2025. Ac- cessed: 2026-01-27

2025
[21]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022

work page internal anchor Pith review arXiv 2022
[22]

Transfer between Modalities with MetaQueries

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

work page internal anchor Pith review arXiv 2025
[23]

Scalable diffusion models with transformers.Proceedings of ICCV, 2023

William Peebles and Saining Xie. Scalable diffusion models with transformers.Proceedings of ICCV, 2023

2023
[24]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review arXiv 2023
[25]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021

2021
[26]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

2022
[27]

High-resolution image synthesis with latent diffusion models.Proceedings of CVPR, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models.Proceedings of CVPR, 2022

2022
[28]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InAdvancesin Neural Information Processing Systems (NeurIPS), volume 35, pages 36479–36494, 2022

2022
[29]

Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

ByteDance Seed Vision Team. Seedream 2.0: A native chinese-english bilingual image generation foundation model. arXiv preprint arXiv:2503.07703, 2025

work page arXiv 2025
[30]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review arXiv 2025
[31]

Llamafusion: Adapting pretrained language models for multimodal generation.arXiv preprint arXiv:2412.15188,

Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, and Lili Yu. Lmfusion: Adapting pretrained language models for multimodal generation.arXiv preprint arXiv:2412.15188, 2024

work page arXiv 2024
[32]

Seededit: Align image re-generation to image editing

Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024

work page arXiv 2024
[33]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 14

work page internal anchor Pith review arXiv 2024
[34]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.arXiv preprint arXiv:2404.02905,

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.arXiv preprint arXiv:2404.02905, 2024

work page arXiv 2024
[35]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arX...

work page internal anchor Pith review doi:10.48550/arxiv.2502.14786 2025
[36]

Multimodal few- shot learning with frozen language models

Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few- shot learning with frozen language models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 200–212, 2021

2021
[37]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. Attention is all you need. InNeurIPS, 2017

2017
[38]

Seededit 3.0: Fast and high-quality generative image editing.arXiv preprint arXiv:2506.05083, 2025

Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, and Jianchao Yang. Seededit 3.0: Fast and high-quality generative image editing.arXiv preprint arXiv:2506.05083, 2025

work page arXiv 2025
[39]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review arXiv 2024
[40]

Lightfusion: A light-weighted, double fusion framework for unified multimodal understanding and generation.arXiv preprint arXiv:2510.22946, 2025

Zeyu Wang, Zilong Chen, Cihang Xie, et al. Lightfusion: A light-weighted, double fusion framework for unified multimodal understanding and generation.arXiv preprint arXiv:2510.22946, 2025

work page arXiv 2025
[41]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Zhao, et al. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2022

work page internal anchor Pith review arXiv 2022
[42]

arXiv preprint arXiv:2504.16656 , year=

Yichen Wei, Wei Shen, Yang Liu, Yahui Zhou, et al. Skywork r1v2: Multimodal hybrid reinforcement learning for reasoning. arXiv preprint arXiv:2504.16656, 2025

work page arXiv 2025
[43]

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yuxuan Ma, Xingchao Liu, Zizheng Pan, Wenbo Chang, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[44]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Shitao Wu, Kai Zheng, Fenging Zhang, Yimin Wang, Han Zhang, Yifan Zhang, Yu Zhou, Wei Feng, Yan Liu, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Show-o: One single transformer to unify multimodal understanding and generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InICLR, 2025

2025
[46]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In International Conference on Computer Vision (ICCV), pages 11975–11986, 2023

2023
[47]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review arXiv 2025
[48]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024. 15 Appendix A Ethical Claims The images presented in the paper are from our lisenced ones, a...

work page internal anchor Pith review arXiv 2024