arxiv: 2605.12305 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Yabo Zhang , Kunchang Li , Dewei Zhou , Xinyu Huang , Xun Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords interleaved instructionsvisual token embeddingmultimodal generationtransformer localityimage consistencydata synthesismultimodal editing

0 comments

The pith

Embedding images directly into text instructions at semantic positions improves consistency in multi-image generation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents INSET, a model that embeds images as native tokens within textual instructions by placing their visual features at corresponding semantic slots. This design allows the transformer to use local contextual information to bind objects precisely to their descriptions, avoiding the long-range dependency problems in separated image-text models. A data engine generates 15 million high-quality interleaved samples from existing datasets using vision and language models. On benchmarks for interleaved instructions, the method shows stronger performance in maintaining consistency across multiple images and aligning with text, especially as the number of images and complexity grow. The method also enables direct multimodal editing by treating input images as part of the instruction sequence.

Core claim

INSET is a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, it leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. A scalable data engine synthesizes 15M high-quality interleaved samples from standard image and video datasets.

What carries the argument

Positioning visual features directly at semantic slots within the textual token sequence to leverage transformer contextual locality for precise object binding.

If this is right

Multi-image consistency and text alignment outperform state-of-the-art methods on InterleaveBench.
Performance gaps widen as input complexity and number of images increase.
The model extends to multimodal image editing by integrating visual content directly into instructions.
A data engine can scalably produce 15M high-quality interleaved samples from standard datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The locality approach may support longer interleaved sequences without extra alignment modules.
Similar token placement could extend to other modalities such as audio segments or video frames.
Data synthesis techniques might generate training sets for related tasks like video generation.

Load-bearing premise

Positioning visual features directly at their corresponding semantic slots in the transformer leverages contextual locality for precise object binding without creating new alignment or training instabilities.

What would settle it

A test showing that multi-image consistency or text alignment fails to improve, or object binding errors increase, when visual features are placed at semantic slots in complex interleaved prompts compared to separated image-text processing.

read the original abstract

While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose \texttt{I}mages i\texttt{N} \texttt{SE}n\texttt{T}ences (\textit{a.k.a}, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizing VLMs and LLMs to construct rich, long-horizon sequences. Evaluation results on InterleaveBench demonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodal image editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

INSET's direct embedding of image features as tokens inside instructions is a clean structural move, but the claims rest on thin experimental detail so far.

read the letter

The main thing to know is that INSET places visual features straight into the text token sequence at the matching semantic positions rather than routing them through separate pathways. The authors also built a data engine that turns ordinary image and video sets into 15 million interleaved training examples using off-the-shelf VLMs and LLMs. This is new: prior work kept images and text structurally apart, while here the model treats the visual vectors as dense vocabulary items that sit inside the instruction. The motivation is straightforward and the data pipeline looks practical for scaling long-horizon sequences. The paper does a decent job spelling out why long-range dependencies hurt consistency in complex multi-image prompts and why locality in the transformer might help binding. The extension to editing by folding visuals into the instruction is a natural side benefit. The soft spots sit in the evaluation. The abstract says INSET beats prior methods on InterleaveBench with gaps that grow as inputs get harder, yet it gives no numbers, no baseline table, and no ablation on how the visual features are actually projected or inserted. The stress-test worry about dimension mismatch or attention instability is reasonable; any such problem would show up exactly where the paper claims its largest gains. Without those checks or stability metrics, it is hard to tell whether the reported improvements come from the insertion trick or simply from more data and bigger models. This paper is for people already working on unified multimodal generators who want a different interleaving pattern and a ready-made data recipe. Readers who care about practical consistency in creative or assistive tools could pick up the data engine and the token-insertion idea even if they end up changing the training details. I would send it to peer review so the full tables, ablations, and implementation choices can be checked, but the authors will need to add concrete evidence on the insertion mechanics before the central claim lands solidly.

Referee Report

2 major / 2 minor

Summary. The paper proposes INSET (Images in Sentences), a unified visual generation model that integrates images directly into textual instructions by embedding visual features at corresponding semantic positions within the transformer sequence. This approach aims to leverage contextual locality for improved multi-image consistency and text alignment. The authors also present a data synthesis pipeline that generates 15 million high-quality interleaved instruction samples using VLMs and LLMs from existing image and video datasets. They evaluate the model on InterleaveBench, claiming significant outperformance over state-of-the-art methods, particularly as instruction complexity increases, and note its extension to multimodal image editing.

Significance. If the empirical claims hold, this work could advance multimodal generation by offering a more integrated way to handle complex interleaved instructions, reducing reliance on long-range dependencies through direct visual token placement. The scalable synthetic data engine for creating 15M interleaved samples is a clear strength that could support future research in the area. The architectural choice to treat images as dense language tokens is conceptually interesting and could unify generation and editing tasks if the integration proves stable.

major comments (2)

[§4 (Experiments)] §4 (Experiments/Evaluation): The central claim that INSET significantly outperforms SOTA methods on InterleaveBench with widening gaps at higher complexity is presented only qualitatively in the abstract and evaluation section. No specific quantitative metrics (e.g., exact scores, tables of results), baseline details, ablation studies, or error analysis are provided, which is load-bearing for assessing whether the gains are due to the proposed insertion method versus data scale or training.
[§3 (Method)] §3 (Method, visual feature insertion): The core assumption that positioning visual features directly at semantic slots leverages contextual locality for precise object binding without creating alignment or training instabilities is not supported by any ablations on projection layers, dimension handling, attention entropy, or loss variance across modalities. This is critical because the largest claimed gains occur at high complexity, where any projection-induced misalignment would be expected to degrade multi-image consistency.

minor comments (2)

[Abstract / §1] The acronym expansion for INSET is given in the abstract but could be restated more explicitly in the introduction for readers who skip the abstract.
[§4 or §5] The extension to multimodal image editing is asserted but lacks any qualitative examples, failure cases, or quantitative comparison to dedicated editing baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and will incorporate the suggested additions into the revised version to strengthen the empirical support for our claims.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments/Evaluation): The central claim that INSET significantly outperforms SOTA methods on InterleaveBench with widening gaps at higher complexity is presented only qualitatively in the abstract and evaluation section. No specific quantitative metrics (e.g., exact scores, tables of results), baseline details, ablation studies, or error analysis are provided, which is load-bearing for assessing whether the gains are due to the proposed insertion method versus data scale or training.

Authors: We agree that the current presentation relies too heavily on qualitative descriptions and figures. In the revised manuscript, we will add a dedicated results table in Section 4 reporting exact quantitative metrics (e.g., multi-image consistency and text alignment scores) for INSET versus all baselines on InterleaveBench, with explicit breakdowns by instruction complexity to demonstrate the widening gaps. We will also specify baseline details, include an ablation comparing our 15M interleaved data synthesis pipeline against training on standard datasets, and add an error analysis subsection. These changes will directly address whether gains arise from the insertion method. revision: yes
Referee: [§3 (Method)] §3 (Method, visual feature insertion): The core assumption that positioning visual features directly at semantic slots leverages contextual locality for precise object binding without creating alignment or training instabilities is not supported by any ablations on projection layers, dimension handling, attention entropy, or loss variance across modalities. This is critical because the largest claimed gains occur at high complexity, where any projection-induced misalignment would be expected to degrade multi-image consistency.

Authors: The insertion strategy is intended to exploit the transformer's local attention patterns by placing visual tokens at semantically corresponding positions. To substantiate this, the revised Section 3 will include new ablations: comparisons of alternative projection layers and dimension alignment techniques, attention entropy visualizations demonstrating localized focus, and training loss variance metrics across text and image modalities. These will be reported for both low- and high-complexity instructions to confirm stability and rule out projection-induced degradation. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal with external data synthesis and empirical evaluation

full rationale

The paper describes an architectural design choice—embedding visual features directly at semantic slots in the transformer sequence to leverage contextual locality for object binding—without any mathematical derivations, equations, or parameter-fitting steps that could reduce to self-referential inputs. The data engine relies on external VLMs and LLMs for synthesizing 15M samples from standard datasets, and performance claims are grounded in evaluation on InterleaveBench rather than internal consistency or self-citations. No load-bearing steps invoke prior author work as uniqueness theorems or smuggle ansatzes; the approach is self-contained as a modeling innovation validated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that transformer locality suffices for binding when images are placed at semantic slots, plus the quality of VLM/LLM-synthesized data; no explicit free parameters or new physical entities are introduced.

axioms (1)

domain assumption Transformers leverage contextual locality for precise object binding when visual features are positioned at corresponding semantic slots.
Invoked to justify why the embedding approach solves long-range dependency problems.

invented entities (1)

INSET architecture no independent evidence
purpose: Unified model treating images as dense language tokens for interleaved generation and editing.
New model proposed in this work; no independent evidence provided beyond the claimed benchmark results.

pith-pipeline@v0.9.0 · 5528 in / 1242 out tokens · 105014 ms · 2026-05-13T05:39:38.661397+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop a scalable data engine that constructs 15M high-quality interleaved samples

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 20 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

work page 2021
[3]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Gemini 2.5 flash image

Google DeepMind. Gemini 2.5 flash image. https://developers.googleblog.com/en/ introducing-gemini-2-5-flash-image/, 2025. Accessed: 2025-10-30

work page 2025
[7]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024. 9

work page 2024
[9]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Chameleon: Hierarchical clustering using dynamic modeling

George Karypis, Eui-Hong Han, and Vipin Kumar. Chameleon: Hierarchical clustering using dynamic modeling. computer, 32(8):68–75, 1999

work page 1999
[11]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InICCV, 2023

work page 2023
[12]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[13]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Omnicorpus: An unified multimodal corpus of 10 billion-level images interleaved with text

Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: An unified multimodal corpus of 10 billion-level images interleaved with text. arXiv preprint arXiv:2406.08418, 2024

work page arXiv 2024
[15]

Describe anything: De- tailed localized image and video captioning.arXiv preprint arXiv:2504.16072, 2025

Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, et al. Describe anything: Detailed localized image and video captioning.arXiv preprint arXiv:2504.16072, 2025

work page arXiv 2025
[16]

Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

work page arXiv 2025
[17]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv:2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

work page 2024
[20]

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7739–7751, 2025

work page 2025
[21]

Dreamo: A unified framework for image customization.arXiv preprint arXiv:2504.16915,

Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al. Dreamo: A unified framework for image customization.arXiv preprint arXiv:2504.16915, 2025

work page arXiv 2025
[22]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Gpt-4v(ision) system card, 2023

OpenAI. Gpt-4v(ision) system card, 2023. URLhttps://openai.com/research/gpt-4v-system-card

work page 2023
[24]

Introducing 4o image generation.https://openai.com/index/introducing-4o-image-generation/,

OpenAI. Introducing 4o image generation.https://openai.com/index/introducing-4o-image-generation/,

work page
[25]

Accessed: 2025-12-19

work page 2025
[26]

Kosmos-g: Generating images in context with multimodal large language models.arXiv preprint arXiv:2310.02992, 2023

Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, and Furu Wei. Kosmos-g: Generating images in context with multimodal large language models.arXiv preprint arXiv:2310.02992, 2023

work page arXiv 2023
[27]

Dreambench++: A human-aligned benchmark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

work page arXiv 2024
[28]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv:2306.14824, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[31]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[32]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024

work page 2024
[34]

Simplear: Pushing the frontier of autoregressive visual generation through pre- training, sft, and rl.arXiv preprint arXiv:2504.11455, 2025

Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Push- ing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXivpreprintarXiv:2504.11455, 2025

work page arXiv 2025
[35]

Skywork unipic: Unified autoregressive modeling for visual understanding and generation

Peiyu Wang, Yi Peng, Yimeng Gan, Liang Hu, Tianyidan Xie, Xiaokun Wang, Yichen Wei, Chuanxin Tang, Bo Zhu, Changshi Li, et al. Skywork unipic: Unified autoregressive modeling for visual understanding and generation. arXiv preprint arXiv:2508.03320, 2025

work page arXiv 2025
[36]

Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024

Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024

work page arXiv 2024
[37]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model

Hongyang Wei, Baixin Xu, Hongbo Liu, Cyrus Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, et al. Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model. arXiv preprint arXiv:2509.04548, 2025

work page arXiv 2025
[39]

Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation

Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15943–15953, 2023

work page 2023
[40]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

work page 2025
[42]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Harmonizing visual representations for unified multimodal understanding and generation.arXiv preprint arXiv:2503.21979, 2025

Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, and Chen Change Loy. Harmonizing visual representations for unified multimodal understanding and generation.arXiv preprint arXiv:2503.21979, 2025. 11

work page arXiv 2025
[44]

arXiv preprint arXiv:2510.06679 (2025)

Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, et al. Dreamomni2: Multimodal instruction-based editing and generation.arXiv preprint arXiv:2510.06679, 2025

work page arXiv 2025
[45]

Omnigen: Unified image generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13294–13304, 2025

work page 2025
[46]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025

Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation. arXiv preprint arXiv:2508.09987, 2025

work page arXiv 2025
[50]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024

work page internal anchor Pith review arXiv 2024
[51]

Image First

Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advancesin Neural Information Processing Systems, 36:8958–8974, 2023. 12 GPT-4o Seedream 4.0Nano BananaOurs A whimsical outdoor g...

work page 2023