pith. machine review for the scientific record. sign in

arxiv: 2605.12305 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords interleaved instructionsvisual token embeddingmultimodal generationtransformer localityimage consistencydata synthesismultimodal editing
0
0 comments X

The pith

Embedding images directly into text instructions at semantic positions improves consistency in multi-image generation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents INSET, a model that embeds images as native tokens within textual instructions by placing their visual features at corresponding semantic slots. This design allows the transformer to use local contextual information to bind objects precisely to their descriptions, avoiding the long-range dependency problems in separated image-text models. A data engine generates 15 million high-quality interleaved samples from existing datasets using vision and language models. On benchmarks for interleaved instructions, the method shows stronger performance in maintaining consistency across multiple images and aligning with text, especially as the number of images and complexity grow. The method also enables direct multimodal editing by treating input images as part of the instruction sequence.

Core claim

INSET is a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, it leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. A scalable data engine synthesizes 15M high-quality interleaved samples from standard image and video datasets.

What carries the argument

Positioning visual features directly at semantic slots within the textual token sequence to leverage transformer contextual locality for precise object binding.

If this is right

  • Multi-image consistency and text alignment outperform state-of-the-art methods on InterleaveBench.
  • Performance gaps widen as input complexity and number of images increase.
  • The model extends to multimodal image editing by integrating visual content directly into instructions.
  • A data engine can scalably produce 15M high-quality interleaved samples from standard datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The locality approach may support longer interleaved sequences without extra alignment modules.
  • Similar token placement could extend to other modalities such as audio segments or video frames.
  • Data synthesis techniques might generate training sets for related tasks like video generation.

Load-bearing premise

Positioning visual features directly at their corresponding semantic slots in the transformer leverages contextual locality for precise object binding without creating new alignment or training instabilities.

What would settle it

A test showing that multi-image consistency or text alignment fails to improve, or object binding errors increase, when visual features are placed at semantic slots in complex interleaved prompts compared to separated image-text processing.

read the original abstract

While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose \texttt{I}mages i\texttt{N} \texttt{SE}n\texttt{T}ences (\textit{a.k.a}, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizing VLMs and LLMs to construct rich, long-horizon sequences. Evaluation results on InterleaveBench demonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodal image editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes INSET (Images in Sentences), a unified visual generation model that integrates images directly into textual instructions by embedding visual features at corresponding semantic positions within the transformer sequence. This approach aims to leverage contextual locality for improved multi-image consistency and text alignment. The authors also present a data synthesis pipeline that generates 15 million high-quality interleaved instruction samples using VLMs and LLMs from existing image and video datasets. They evaluate the model on InterleaveBench, claiming significant outperformance over state-of-the-art methods, particularly as instruction complexity increases, and note its extension to multimodal image editing.

Significance. If the empirical claims hold, this work could advance multimodal generation by offering a more integrated way to handle complex interleaved instructions, reducing reliance on long-range dependencies through direct visual token placement. The scalable synthetic data engine for creating 15M interleaved samples is a clear strength that could support future research in the area. The architectural choice to treat images as dense language tokens is conceptually interesting and could unify generation and editing tasks if the integration proves stable.

major comments (2)
  1. [§4 (Experiments)] §4 (Experiments/Evaluation): The central claim that INSET significantly outperforms SOTA methods on InterleaveBench with widening gaps at higher complexity is presented only qualitatively in the abstract and evaluation section. No specific quantitative metrics (e.g., exact scores, tables of results), baseline details, ablation studies, or error analysis are provided, which is load-bearing for assessing whether the gains are due to the proposed insertion method versus data scale or training.
  2. [§3 (Method)] §3 (Method, visual feature insertion): The core assumption that positioning visual features directly at semantic slots leverages contextual locality for precise object binding without creating alignment or training instabilities is not supported by any ablations on projection layers, dimension handling, attention entropy, or loss variance across modalities. This is critical because the largest claimed gains occur at high complexity, where any projection-induced misalignment would be expected to degrade multi-image consistency.
minor comments (2)
  1. [Abstract / §1] The acronym expansion for INSET is given in the abstract but could be restated more explicitly in the introduction for readers who skip the abstract.
  2. [§4 or §5] The extension to multimodal image editing is asserted but lacks any qualitative examples, failure cases, or quantitative comparison to dedicated editing baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and will incorporate the suggested additions into the revised version to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [§4 (Experiments)] §4 (Experiments/Evaluation): The central claim that INSET significantly outperforms SOTA methods on InterleaveBench with widening gaps at higher complexity is presented only qualitatively in the abstract and evaluation section. No specific quantitative metrics (e.g., exact scores, tables of results), baseline details, ablation studies, or error analysis are provided, which is load-bearing for assessing whether the gains are due to the proposed insertion method versus data scale or training.

    Authors: We agree that the current presentation relies too heavily on qualitative descriptions and figures. In the revised manuscript, we will add a dedicated results table in Section 4 reporting exact quantitative metrics (e.g., multi-image consistency and text alignment scores) for INSET versus all baselines on InterleaveBench, with explicit breakdowns by instruction complexity to demonstrate the widening gaps. We will also specify baseline details, include an ablation comparing our 15M interleaved data synthesis pipeline against training on standard datasets, and add an error analysis subsection. These changes will directly address whether gains arise from the insertion method. revision: yes

  2. Referee: [§3 (Method)] §3 (Method, visual feature insertion): The core assumption that positioning visual features directly at semantic slots leverages contextual locality for precise object binding without creating alignment or training instabilities is not supported by any ablations on projection layers, dimension handling, attention entropy, or loss variance across modalities. This is critical because the largest claimed gains occur at high complexity, where any projection-induced misalignment would be expected to degrade multi-image consistency.

    Authors: The insertion strategy is intended to exploit the transformer's local attention patterns by placing visual tokens at semantically corresponding positions. To substantiate this, the revised Section 3 will include new ablations: comparisons of alternative projection layers and dimension alignment techniques, attention entropy visualizations demonstrating localized focus, and training loss variance metrics across text and image modalities. These will be reported for both low- and high-complexity instructions to confirm stability and rule out projection-induced degradation. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal with external data synthesis and empirical evaluation

full rationale

The paper describes an architectural design choice—embedding visual features directly at semantic slots in the transformer sequence to leverage contextual locality for object binding—without any mathematical derivations, equations, or parameter-fitting steps that could reduce to self-referential inputs. The data engine relies on external VLMs and LLMs for synthesizing 15M samples from standard datasets, and performance claims are grounded in evaluation on InterleaveBench rather than internal consistency or self-citations. No load-bearing steps invoke prior author work as uniqueness theorems or smuggle ansatzes; the approach is self-contained as a modeling innovation validated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that transformer locality suffices for binding when images are placed at semantic slots, plus the quality of VLM/LLM-synthesized data; no explicit free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption Transformers leverage contextual locality for precise object binding when visual features are positioned at corresponding semantic slots.
    Invoked to justify why the embedding approach solves long-range dependency problems.
invented entities (1)
  • INSET architecture no independent evidence
    purpose: Unified model treating images as dense language tokens for interleaved generation and editing.
    New model proposed in this work; no independent evidence provided beyond the claimed benchmark results.

pith-pipeline@v0.9.0 · 5528 in / 1242 out tokens · 105014 ms · 2026-05-13T05:39:38.661397+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 20 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  2. [2]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  3. [3]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025

  4. [4]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025

  5. [5]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  6. [6]

    Gemini 2.5 flash image

    Google DeepMind. Gemini 2.5 flash image. https://developers.googleblog.com/en/ introducing-gemini-2-5-flash-image/, 2025. Accessed: 2025-10-30

  7. [7]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  8. [8]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024. 9

  9. [9]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  10. [10]

    Chameleon: Hierarchical clustering using dynamic modeling

    George Karypis, Eui-Hong Han, and Vipin Kumar. Chameleon: Hierarchical clustering using dynamic modeling. computer, 32(8):68–75, 1999

  11. [11]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InICCV, 2023

  12. [12]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  13. [13]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

  14. [14]

    Omnicorpus: An unified multimodal corpus of 10 billion-level images interleaved with text

    Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: An unified multimodal corpus of 10 billion-level images interleaved with text. arXiv preprint arXiv:2406.08418, 2024

  15. [15]

    Describe anything: De- tailed localized image and video captioning.arXiv preprint arXiv:2504.16072, 2025

    Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, et al. Describe anything: Detailed localized image and video captioning.arXiv preprint arXiv:2504.16072, 2025

  16. [16]

    Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

    Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

  17. [17]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147, 2025

  18. [18]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv:2304.08485, 2023

  19. [19]

    Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

  20. [20]

    Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7739–7751, 2025

  21. [21]

    Dreamo: A unified framework for image customization.arXiv preprint arXiv:2504.16915,

    Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al. Dreamo: A unified framework for image customization.arXiv preprint arXiv:2504.16915, 2025

  22. [22]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv:2303.08774, 2023

  23. [23]

    Gpt-4v(ision) system card, 2023

    OpenAI. Gpt-4v(ision) system card, 2023. URLhttps://openai.com/research/gpt-4v-system-card

  24. [24]

    Introducing 4o image generation.https://openai.com/index/introducing-4o-image-generation/,

    OpenAI. Introducing 4o image generation.https://openai.com/index/introducing-4o-image-generation/,

  25. [25]

    Accessed: 2025-12-19

  26. [26]

    Kosmos-g: Generating images in context with multimodal large language models.arXiv preprint arXiv:2310.02992, 2023

    Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, and Furu Wei. Kosmos-g: Generating images in context with multimodal large language models.arXiv preprint arXiv:2310.02992, 2023

  27. [27]

    Dreambench++: A human-aligned benchmark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

    Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

  28. [28]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv:2306.14824, 2023. 10

  29. [29]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  30. [30]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

  31. [31]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  32. [32]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

  33. [33]

    Generative multimodal models are in-context learners

    Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024

  34. [34]

    Simplear: Pushing the frontier of autoregressive visual generation through pre- training, sft, and rl.arXiv preprint arXiv:2504.11455, 2025

    Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Push- ing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXivpreprintarXiv:2504.11455, 2025

  35. [35]

    Skywork unipic: Unified autoregressive modeling for visual understanding and generation

    Peiyu Wang, Yi Peng, Yimeng Gan, Liang Hu, Tianyidan Xie, Xiaokun Wang, Yichen Wei, Chuanxin Tang, Bo Zhu, Changshi Li, et al. Skywork unipic: Unified autoregressive modeling for visual understanding and generation. arXiv preprint arXiv:2508.03320, 2025

  36. [36]

    Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024

    Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024

  37. [37]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

  38. [38]

    Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model

    Hongyang Wei, Baixin Xu, Hongbo Liu, Cyrus Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, et al. Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model. arXiv preprint arXiv:2509.04548, 2025

  39. [39]

    Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation

    Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15943–15953, 2023

  40. [40]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

  41. [41]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

  42. [42]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871, 2025

  43. [43]

    Harmonizing visual representations for unified multimodal understanding and generation.arXiv preprint arXiv:2503.21979, 2025

    Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, and Chen Change Loy. Harmonizing visual representations for unified multimodal understanding and generation.arXiv preprint arXiv:2503.21979, 2025. 11

  44. [44]

    arXiv preprint arXiv:2510.06679 (2025)

    Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, et al. Dreamomni2: Multimodal instruction-based editing and generation.arXiv preprint arXiv:2510.06679, 2025

  45. [45]

    Omnigen: Unified image generation

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13294–13304, 2025

  46. [46]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024

  47. [47]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

  48. [48]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

  49. [49]

    Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025

    Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation. arXiv preprint arXiv:2508.09987, 2025

  50. [50]

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024

  51. [51]

    Image First

    Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advancesin Neural Information Processing Systems, 36:8958–8974, 2023. 12 GPT-4o Seedream 4.0Nano BananaOurs A whimsical outdoor g...