Imagine Before You Draw: Visual Prompt Engineering for Image Generation

Aojun Zhou; Fengda Zhang; Hanwang Zhang; Jiachun Pan; Kesen Zhao; Liyu Jia; Saining Zhang; Wang Lin; Weijia Wu; Yue Liao

arxiv: 2606.04457 · v1 · pith:DKPDOY7Onew · submitted 2026-06-03 · 💻 cs.CV

Imagine Before You Draw: Visual Prompt Engineering for Image Generation

Liyu Jia , Fengda Zhang , Jiachun Pan , Kesen Zhao , Saining Zhang , Wang Lin , Weijia Wu , Yue Liao

show 2 more authors

Aojun Zhou Hanwang Zhang

This is my paper

Pith reviewed 2026-06-28 07:10 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual prompt engineeringsemantic tokensimage generationtext-to-imageimage editingautoregressive modelsdiffusion models

0 comments

The pith

Integrating visual semantic token generation inside image models avoids external pipeline bottlenecks and improves editing preservation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Visual Prompt Engineering to let a single model first autoregressively produce visual semantic tokens that outline semantic layout, then generate image tokens conditioned on that internal plan. This design targets the information bottleneck created when separate models handle semantic planning and image decoding in sequence. By keeping both the original input and the semantic plan accessible to the decoder at once, the approach seeks to ease the text-to-image modeling gap while preserving finer details. Experiments across generation and editing tasks show faster convergence and higher ceilings, with the largest gains appearing in editing where external methods lose information.

Core claim

VPE has the model autoregressively generate visual semantic tokens such as SigLIP 2 as visual prompts that capture semantic layout, then generate full image tokens conditioned on this plan inside the same unified architecture. This internal step enables joint access to input and plan, unlike external two-stage pipelines that feed semantic tokens from one model to an independent decoder.

What carries the argument

Internal autoregressive generation of visual semantic tokens as prompts that guide subsequent image token generation within one unified model.

If this is right

Accelerates convergence during training across class-conditional, text-to-image, and editing setups.
Raises quality ceilings for generated images while using the same model scale.
Delivers higher editing preservation measured at PSNR 26.76 compared with 19.92 for external alternatives.
Maintains competitive editing responsiveness without extra parameters.
Applies to various token types and both autoregressive and diffusion-based internal architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Single-model systems could handle planning and rendering stages more efficiently by reusing internal representations rather than passing data between separate components.
Similar internal prompting might extend to video generation where temporal consistency requires joint access to semantic plans and pixel output.
The method could reduce the parameter overhead of maintaining parallel external planners for downstream editing workflows.

Load-bearing premise

Integrating semantic token generation inside the same model allows the image decoder to jointly access both the original input and the semantic plan.

What would settle it

An external two-stage system reaching PSNR 26 or higher in editing tasks at the same parameter scale would undermine the claimed benefit of internal integration.

Figures

Figures reproduced from arXiv: 2606.04457 by Aojun Zhou, Fengda Zhang, Hanwang Zhang, Jiachun Pan, Kesen Zhao, Liyu Jia, Saining Zhang, Wang Lin, Weijia Wu, Yue Liao.

**Figure 1.** Figure 1: VPE in two architectural frameworks. VPE inserts SigLIP 2 visual prompts before image generation as a semantic plan. (a) In the internal architecture, all modalities (text, SigLIP 2 tokens, image tokens) share attention within a single model via Eq. 4, enabling direct cross-referencing during generation. (b) In the external architecture, an AR model first produces SigLIP 2 features, which are then routed t… view at source ↗

**Figure 2.** Figure 2: Convergence on ImageNet. (a) LlamaGen FID-50K (512×512): VPE starts higher due to progressive masking (pmask=0.95) but surpasses the baseline from epoch 50, with the gap widening to 15.87 by epoch 100. (b) LlamaGen training loss converges to a lower value with VPE. (c) Transfusion FID-50K (256×256): both branches fork from the shared checkpoint (dashed line). The initial high FID of VPE at 150k reflects th… view at source ↗

**Figure 3.** Figure 3: Text rendering comparison. Show-o2+VPE generates more accurate text compared to Show-o2∗ , consistent with the quantitative improvements in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Editing comparison. Each triplet shows Reference / Internal+VPE / External+VPE. Internal+VPE preserves unedited regions while External+VPE regenerates inconsistent backgrounds. General T2I results [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Training loss curves. Top row: VPE vs. baseline training loss for each framework. (a) LlamaGen+VPE converges to lower loss than LlamaGen. (b) Transfusion+VPE converges to lower flow loss than Transfusion after forking from the shared base. (c) Show-o2+VPE achieves lower flow loss than Show-o2∗ on text rendering. Bottom row: Internal vs. External architecture training. (d) Stage-1 SigLIP 2 alignment. (e) St… view at source ↗

read the original abstract

Incorporating visual semantic representations as an intermediate step before image generation can reduce the modeling difficulty between text and images, thereby improving generation quality. Recent works such as X-Omni and BLIP3o-Next have explored this direction, but they typically use a two-stage external pipeline: a separate autoregressive model first generates semantic tokens, which are then fed as conditioning to an independent diffusion decoder. Since the decoder cannot jointly access the original input and the semantic plan, this design introduces an information bottleneck that limits detail preservation in downstream tasks such as editing. Internal architectures such as Transfusion, BAGEL, and Show-o2 avoid this bottleneck by enabling cross-modal interaction within a single model, but they still face the difficult text-to-pixel modeling gap without intermediate semantic guidance. We propose Visual Prompt Engineering (VPE), which can be seamlessly integrated into such internal frameworks. Specifically, the model first autoregressively generates visual semantic tokens (e.g., SigLIP 2) as "visual prompts" that capture the semantic layout, then generates the full image tokens conditioned on this plan. We validate VPE across class-conditional generation, text-to-image generation, and image editing, covering various token types and model architectures. Results show that VPE can accelerate convergence, raise quality ceilings, and through internal integration, achieve substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale, while maintaining competitive editing responsiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VPE puts semantic token generation inside unified models to fix the external pipeline bottleneck, but the reported editing gains need tighter controls on training and capacity.

read the letter

The key point is that VPE adds an internal autoregressive step for visual semantic tokens before image generation in unified models, claiming better detail preservation in editing than external pipelines.

This approach is new because it places the semantic planning inside the same model rather than using a separate autoregressive model followed by a diffusion decoder. The paper argues that external setups create an information bottleneck since the decoder cannot access the original input alongside the semantic plan. Internal models avoid this but lack the intermediate guidance, so VPE combines both by generating SigLIP-like tokens first as prompts, then conditioning the image tokens on them.

It does well at identifying a practical problem in current multimodal architectures and proposing a seamless integration. The results cover multiple tasks and show faster convergence plus higher quality, with a notable PSNR improvement in editing from 19.92 to 26.76 against same-scale external methods.

The soft spot is the lack of isolation in the comparison. The abstract attributes the gain to joint access, but it is unclear whether the baselines match on training compute, optimization, or if the internal model has additional capacity allocated to semantic token prediction. Without those controls, the delta might reflect differences in how the models are trained rather than the architectural change alone. The soundness is limited by the absence of methods details in what we have.

This paper is aimed at people working on unified text-image models for generation and editing. A reader focused on controllable generation will get value from the framing and the empirical direction, even if the experiments require closer look.

I would recommend sending it for peer review so the full experimental design can be evaluated.

Referee Report

2 major / 2 minor

Summary. The paper proposes Visual Prompt Engineering (VPE), which integrates autoregressive generation of visual semantic tokens (e.g., SigLIP 2) as internal 'visual prompts' before full image token generation within unified models such as Transfusion or Show-o2. This is positioned as avoiding the information bottleneck of external two-stage pipelines (e.g., X-Omni) while providing semantic guidance absent in pure internal architectures. Validation is claimed across class-conditional generation, text-to-image, and image editing, with a highlighted quantitative result of improved editing preservation (PSNR 26.76 vs. 19.92) at the same parameter scale and competitive responsiveness.

Significance. If the reported PSNR gains and convergence improvements hold under matched experimental conditions, the work would demonstrate a practical internal mechanism for semantic planning that improves detail preservation in editing without external pipelines. The empirical coverage across tasks and token types provides a useful data point for unified multimodal architectures.

major comments (2)

[Abstract] Abstract: The central claim that 'through internal integration, VPE achieves substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale' is load-bearing for the paper's contribution, yet the manuscript provides no details on whether the external baselines match total parameters (including any extra capacity allocated to VPE's autoregressive semantic head), training compute, data, or optimization procedure. This leaves open the possibility that the delta arises from capacity or end-to-end training differences rather than joint access to input and semantic plan.
[Abstract] Abstract (and results sections): The assumption that the image decoder jointly accesses the original input and semantic plan to avoid the external bottleneck is presented as the mechanism, but no ablation or controlled comparison isolates this factor from unified training or architectural differences. Without such isolation (e.g., via matched two-stage internal variants), the attribution remains unverified.

minor comments (2)

[Abstract] The abstract mentions validation 'covering various token types and model architectures' but does not name the specific internal frameworks, tokenizers, or additional metrics used beyond PSNR; adding these would improve reproducibility.
[Abstract] Presentation of quantitative results would benefit from error bars, number of runs, or baseline implementation details to allow assessment of the reported PSNR values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with clarifications and indicate where revisions will be made to improve transparency and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'through internal integration, VPE achieves substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale' is load-bearing for the paper's contribution, yet the manuscript provides no details on whether the external baselines match total parameters (including any extra capacity allocated to VPE's autoregressive semantic head), training compute, data, or optimization procedure. This leaves open the possibility that the delta arises from capacity or end-to-end training differences rather than joint access to input and semantic plan.

Authors: We appreciate the referee's emphasis on ensuring fair comparisons. The external baselines such as X-Omni consist of a separate autoregressive semantic model plus an independent decoder, and our reported comparisons use models with matched total parameter counts (VPE's unified architecture allocates capacity to the semantic head internally within the same budget). To eliminate any ambiguity, the revised manuscript will include an explicit table in the appendix detailing parameter counts, training compute, datasets, and optimization settings for VPE and all baselines. revision: yes
Referee: [Abstract] Abstract (and results sections): The assumption that the image decoder jointly accesses the original input and semantic plan to avoid the external bottleneck is presented as the mechanism, but no ablation or controlled comparison isolates this factor from unified training or architectural differences. Without such isolation (e.g., via matched two-stage internal variants), the attribution remains unverified.

Authors: In unified models such as Transfusion and Show-o2, the image token generation stage operates within a single transformer that can attend jointly to the original conditioning (text or image) and the autoregressively generated semantic tokens via cross-attention and shared layers. This joint access is architecturally unavailable in external two-stage pipelines. We agree that an explicit ablation against a matched internal two-stage variant would further isolate the effect; however, training such variants requires substantial additional compute. In the revision we will expand the discussion and method sections to provide a more detailed architectural analysis of the joint-access mechanism and its relation to the observed PSNR gains. revision: partial

Circularity Check

0 steps flagged

No circularity: architectural proposal with empirical validation only

full rationale

The paper proposes Visual Prompt Engineering (VPE) as an internal architectural integration into existing frameworks like Transfusion or Show-o2, where the model autoregressively generates semantic tokens before image tokens. No equations, derivations, fitted parameters, or mathematical claims are present in the text. The central claim of improved editing preservation (PSNR 26.76 vs 19.92) is presented as an empirical outcome of joint access within one model, without any reduction to self-citations, ansatzes, or renamed known results. The comparison to external pipelines is stated directly as a design difference, not derived from prior author work or by construction. This is a standard non-circular empirical architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no model equations, training objectives, or parameter counts are visible, so ledger entries are limited to the core domain assumption stated in the text.

axioms (1)

domain assumption Visual semantic tokens (e.g., from SigLIP 2) capture sufficient semantic layout to serve as effective conditioning for full image generation.
Invoked as the basis for the visual prompt step that reduces modeling difficulty.

pith-pipeline@v0.9.1-grok · 5828 in / 1240 out tokens · 27279 ms · 2026-06-28T07:10:30.171859+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 16 linked inside Pith

[1]

Show-o: One single trans- former to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single trans- former to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

Pith/arXiv arXiv 2024
[2]

Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

Pith/arXiv arXiv 2025
[3]

Emu3: Next-token prediction is all you need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

Pith/arXiv arXiv 2024
[4]

Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024

Pith/arXiv arXiv 2024
[5]

Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

Pith/arXiv arXiv 2025
[6]

X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

arXiv 2025
[7]

Blip3o-next: Next frontier of native image generation

Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, et al. Blip3o-next: Next frontier of native image generation. arXiv preprint arXiv:2510.15857, 2025

arXiv 2025
[8]

Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Pith/arXiv arXiv 2025
[9]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

2017
[10]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022
[11]

Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022
[12]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[13]

Cogview: Mastering text-to-image generation via transformers.Advances in neural information processing systems, 34:19822–19835, 2021

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers.Advances in neural information processing systems, 34:19822–19835, 2021

2021
[14]

Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022

Pith/arXiv arXiv 2022
[15]

Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

Pith/arXiv arXiv 2024
[16]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 10

2021
[17]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

2024
[18]

Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

2024
[19]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025

2025
[20]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[21]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024
[22]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

2024
[23]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

2022
[24]

Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024

2024
[25]

Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Pith/arXiv arXiv 2025
[26]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[27]

Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023
[28]

Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

Pith/arXiv arXiv 2022
[29]

Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

Pith/arXiv arXiv 2024
[30]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

2023
[31]

Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023. 11

2023
[32]

Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

Pith/arXiv arXiv 2024
[33]

Editar: Unified conditional generation with autoregressive models

Jiteng Mu, Nuno Vasconcelos, and Xiaolong Wang. Editar: Unified conditional generation with autoregressive models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7899–7909, 2025

2025
[34]

Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

Pith/arXiv arXiv 2022
[35]

Null-text inversion for editing real images using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6038–6047, 2023

2023
[36]

Direct inversion: Boosting diffusion-based editing with 3 lines of code.arXiv preprint arXiv:2310.01506, 2023

Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based editing with 3 lines of code.arXiv preprint arXiv:2310.01506, 2023

arXiv 2023
[37]

Unicedit-10m: A dataset and benchmark breaking the scale-quality barrier via unified verification for reasoning-enriched edits.arXiv preprint arXiv:2512.02790, 2025

Keming Ye, Zhipeng Huang, Canmiao Fu, Qingyang Liu, Jiani Cai, Zheqi Lv, Chen Li, Jing Lyu, Zhou Zhao, and Shengyu Zhang. Unicedit-10m: A dataset and benchmark breaking the scale-quality barrier via unified verification for reasoning-enriched edits.arXiv preprint arXiv:2512.02790, 2025

arXiv 2025
[38]

Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

arXiv 2025
[39]

Smartedit: Exploring complex instruction- based image editing with multimodal large language models

Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. Smartedit: Exploring complex instruction- based image editing with multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8362–8371, 2024

2024
[40]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009
[41]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

2023
[42]

Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

arXiv 2023
[43]

Textdiffuser-2: Unleashing the power of language models for text rendering

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024

2024
[44]

Illume: Illuminating your llms to see, draw, and self-enhance

Chunwei Wang, Guansong Lu, Junwei Yang, Runhui Huang, Jianhua Han, Lu Hou, Wei Zhang, and Hang Xu. Illume: Illuminating your llms to see, draw, and self-enhance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21612–21622, 2025

2025
[45]

Dual diffusion for unified image generation and understanding

Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2779–2790, 2025

2025
[46]

Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

Pith/arXiv arXiv 2025
[47]

Textatlas5m: A large-scale dataset for dense text image generation.arXiv preprint arXiv:2502.07870, 2025

Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, et al. Textatlas5m: A large-scale dataset for dense text image generation.arXiv preprint arXiv:2502.07870, 2025. 12

arXiv 2025
[48]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

2023
[49]

Nohumansrequired: Autonomous high-quality image editing triplet mining

Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh, Georgii Fedorov, Bulat Suleimanov, Vladimir Dokholyan, and Aleksandr Gordeev. Nohumansrequired: Autonomous high-quality image editing triplet mining. InProceedings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, pages 6059–6068, 2026

2026
[50]

CE only” indicates that only cross-entropy loss on SigLIP 2 tokens is used (no flow matching loss). “Flow only

Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025. 13 A Full Class-to-Image Results Tables 6 and 7 report the complete evaluation metrics at more checkpoints for the class-to-image...

arXiv 2025

[1] [1]

Show-o: One single trans- former to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single trans- former to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

Pith/arXiv arXiv 2024

[2] [2]

Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

Pith/arXiv arXiv 2025

[3] [3]

Emu3: Next-token prediction is all you need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

Pith/arXiv arXiv 2024

[4] [4]

Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024

Pith/arXiv arXiv 2024

[5] [5]

Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

Pith/arXiv arXiv 2025

[6] [6]

X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

arXiv 2025

[7] [7]

Blip3o-next: Next frontier of native image generation

Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, et al. Blip3o-next: Next frontier of native image generation. arXiv preprint arXiv:2510.15857, 2025

arXiv 2025

[8] [8]

Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Pith/arXiv arXiv 2025

[9] [9]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

2017

[10] [10]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022

[11] [11]

Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022

[12] [12]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[13] [13]

Cogview: Mastering text-to-image generation via transformers.Advances in neural information processing systems, 34:19822–19835, 2021

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers.Advances in neural information processing systems, 34:19822–19835, 2021

2021

[14] [14]

Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022

Pith/arXiv arXiv 2022

[15] [15]

Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

Pith/arXiv arXiv 2024

[16] [16]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 10

2021

[17] [17]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

2024

[18] [18]

Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

2024

[19] [19]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025

2025

[20] [20]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020

[21] [21]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024

[22] [22]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

2024

[23] [23]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

2022

[24] [24]

Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024

2024

[25] [25]

Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Pith/arXiv arXiv 2025

[26] [26]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[27] [27]

Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023

[28] [28]

Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

Pith/arXiv arXiv 2022

[29] [29]

Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

Pith/arXiv arXiv 2024

[30] [30]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

2023

[31] [31]

Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023. 11

2023

[32] [32]

Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

Pith/arXiv arXiv 2024

[33] [33]

Editar: Unified conditional generation with autoregressive models

Jiteng Mu, Nuno Vasconcelos, and Xiaolong Wang. Editar: Unified conditional generation with autoregressive models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7899–7909, 2025

2025

[34] [34]

Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

Pith/arXiv arXiv 2022

[35] [35]

Null-text inversion for editing real images using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6038–6047, 2023

2023

[36] [36]

Direct inversion: Boosting diffusion-based editing with 3 lines of code.arXiv preprint arXiv:2310.01506, 2023

Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based editing with 3 lines of code.arXiv preprint arXiv:2310.01506, 2023

arXiv 2023

[37] [37]

Unicedit-10m: A dataset and benchmark breaking the scale-quality barrier via unified verification for reasoning-enriched edits.arXiv preprint arXiv:2512.02790, 2025

Keming Ye, Zhipeng Huang, Canmiao Fu, Qingyang Liu, Jiani Cai, Zheqi Lv, Chen Li, Jing Lyu, Zhou Zhao, and Shengyu Zhang. Unicedit-10m: A dataset and benchmark breaking the scale-quality barrier via unified verification for reasoning-enriched edits.arXiv preprint arXiv:2512.02790, 2025

arXiv 2025

[38] [38]

Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

arXiv 2025

[39] [39]

Smartedit: Exploring complex instruction- based image editing with multimodal large language models

Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. Smartedit: Exploring complex instruction- based image editing with multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8362–8371, 2024

2024

[40] [40]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009

[41] [41]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

2023

[42] [42]

Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

arXiv 2023

[43] [43]

Textdiffuser-2: Unleashing the power of language models for text rendering

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024

2024

[44] [44]

Illume: Illuminating your llms to see, draw, and self-enhance

Chunwei Wang, Guansong Lu, Junwei Yang, Runhui Huang, Jianhua Han, Lu Hou, Wei Zhang, and Hang Xu. Illume: Illuminating your llms to see, draw, and self-enhance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21612–21622, 2025

2025

[45] [45]

Dual diffusion for unified image generation and understanding

Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2779–2790, 2025

2025

[46] [46]

Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

Pith/arXiv arXiv 2025

[47] [47]

Textatlas5m: A large-scale dataset for dense text image generation.arXiv preprint arXiv:2502.07870, 2025

Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, et al. Textatlas5m: A large-scale dataset for dense text image generation.arXiv preprint arXiv:2502.07870, 2025. 12

arXiv 2025

[48] [48]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

2023

[49] [49]

Nohumansrequired: Autonomous high-quality image editing triplet mining

Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh, Georgii Fedorov, Bulat Suleimanov, Vladimir Dokholyan, and Aleksandr Gordeev. Nohumansrequired: Autonomous high-quality image editing triplet mining. InProceedings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, pages 6059–6068, 2026

2026

[50] [50]

CE only” indicates that only cross-entropy loss on SigLIP 2 tokens is used (no flow matching loss). “Flow only

Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025. 13 A Full Class-to-Image Results Tables 6 and 7 report the complete evaluation metrics at more checkpoints for the class-to-image...

arXiv 2025