pith. sign in

arxiv: 2606.04457 · v1 · pith:DKPDOY7Onew · submitted 2026-06-03 · 💻 cs.CV

Imagine Before You Draw: Visual Prompt Engineering for Image Generation

Pith reviewed 2026-06-28 07:10 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual prompt engineeringsemantic tokensimage generationtext-to-imageimage editingautoregressive modelsdiffusion models
0
0 comments X

The pith

Integrating visual semantic token generation inside image models avoids external pipeline bottlenecks and improves editing preservation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Visual Prompt Engineering to let a single model first autoregressively produce visual semantic tokens that outline semantic layout, then generate image tokens conditioned on that internal plan. This design targets the information bottleneck created when separate models handle semantic planning and image decoding in sequence. By keeping both the original input and the semantic plan accessible to the decoder at once, the approach seeks to ease the text-to-image modeling gap while preserving finer details. Experiments across generation and editing tasks show faster convergence and higher ceilings, with the largest gains appearing in editing where external methods lose information.

Core claim

VPE has the model autoregressively generate visual semantic tokens such as SigLIP 2 as visual prompts that capture semantic layout, then generate full image tokens conditioned on this plan inside the same unified architecture. This internal step enables joint access to input and plan, unlike external two-stage pipelines that feed semantic tokens from one model to an independent decoder.

What carries the argument

Internal autoregressive generation of visual semantic tokens as prompts that guide subsequent image token generation within one unified model.

If this is right

  • Accelerates convergence during training across class-conditional, text-to-image, and editing setups.
  • Raises quality ceilings for generated images while using the same model scale.
  • Delivers higher editing preservation measured at PSNR 26.76 compared with 19.92 for external alternatives.
  • Maintains competitive editing responsiveness without extra parameters.
  • Applies to various token types and both autoregressive and diffusion-based internal architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Single-model systems could handle planning and rendering stages more efficiently by reusing internal representations rather than passing data between separate components.
  • Similar internal prompting might extend to video generation where temporal consistency requires joint access to semantic plans and pixel output.
  • The method could reduce the parameter overhead of maintaining parallel external planners for downstream editing workflows.

Load-bearing premise

Integrating semantic token generation inside the same model allows the image decoder to jointly access both the original input and the semantic plan.

What would settle it

An external two-stage system reaching PSNR 26 or higher in editing tasks at the same parameter scale would undermine the claimed benefit of internal integration.

Figures

Figures reproduced from arXiv: 2606.04457 by Aojun Zhou, Fengda Zhang, Hanwang Zhang, Jiachun Pan, Kesen Zhao, Liyu Jia, Saining Zhang, Wang Lin, Weijia Wu, Yue Liao.

Figure 1
Figure 1. Figure 1: VPE in two architectural frameworks. VPE inserts SigLIP 2 visual prompts before image generation as a semantic plan. (a) In the internal architecture, all modalities (text, SigLIP 2 tokens, image tokens) share attention within a single model via Eq. 4, enabling direct cross-referencing during generation. (b) In the external architecture, an AR model first produces SigLIP 2 features, which are then routed t… view at source ↗
Figure 2
Figure 2. Figure 2: Convergence on ImageNet. (a) LlamaGen FID-50K (512×512): VPE starts higher due to progressive masking (pmask=0.95) but surpasses the baseline from epoch 50, with the gap widening to 15.87 by epoch 100. (b) LlamaGen training loss converges to a lower value with VPE. (c) Transfusion FID-50K (256×256): both branches fork from the shared checkpoint (dashed line). The initial high FID of VPE at 150k reflects th… view at source ↗
Figure 3
Figure 3. Figure 3: Text rendering comparison. Show-o2+VPE generates more accurate text compared to Show-o2∗ , consistent with the quantitative improvements in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Editing comparison. Each triplet shows Reference / Internal+VPE / External+VPE. Internal+VPE preserves unedited regions while External+VPE regenerates inconsistent backgrounds. General T2I results [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training loss curves. Top row: VPE vs. baseline training loss for each framework. (a) LlamaGen+VPE converges to lower loss than LlamaGen. (b) Transfusion+VPE converges to lower flow loss than Transfusion after forking from the shared base. (c) Show-o2+VPE achieves lower flow loss than Show-o2∗ on text rendering. Bottom row: Internal vs. External architecture training. (d) Stage-1 SigLIP 2 alignment. (e) St… view at source ↗
read the original abstract

Incorporating visual semantic representations as an intermediate step before image generation can reduce the modeling difficulty between text and images, thereby improving generation quality. Recent works such as X-Omni and BLIP3o-Next have explored this direction, but they typically use a two-stage external pipeline: a separate autoregressive model first generates semantic tokens, which are then fed as conditioning to an independent diffusion decoder. Since the decoder cannot jointly access the original input and the semantic plan, this design introduces an information bottleneck that limits detail preservation in downstream tasks such as editing. Internal architectures such as Transfusion, BAGEL, and Show-o2 avoid this bottleneck by enabling cross-modal interaction within a single model, but they still face the difficult text-to-pixel modeling gap without intermediate semantic guidance. We propose Visual Prompt Engineering (VPE), which can be seamlessly integrated into such internal frameworks. Specifically, the model first autoregressively generates visual semantic tokens (e.g., SigLIP 2) as "visual prompts" that capture the semantic layout, then generates the full image tokens conditioned on this plan. We validate VPE across class-conditional generation, text-to-image generation, and image editing, covering various token types and model architectures. Results show that VPE can accelerate convergence, raise quality ceilings, and through internal integration, achieve substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale, while maintaining competitive editing responsiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Visual Prompt Engineering (VPE), which integrates autoregressive generation of visual semantic tokens (e.g., SigLIP 2) as internal 'visual prompts' before full image token generation within unified models such as Transfusion or Show-o2. This is positioned as avoiding the information bottleneck of external two-stage pipelines (e.g., X-Omni) while providing semantic guidance absent in pure internal architectures. Validation is claimed across class-conditional generation, text-to-image, and image editing, with a highlighted quantitative result of improved editing preservation (PSNR 26.76 vs. 19.92) at the same parameter scale and competitive responsiveness.

Significance. If the reported PSNR gains and convergence improvements hold under matched experimental conditions, the work would demonstrate a practical internal mechanism for semantic planning that improves detail preservation in editing without external pipelines. The empirical coverage across tasks and token types provides a useful data point for unified multimodal architectures.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'through internal integration, VPE achieves substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale' is load-bearing for the paper's contribution, yet the manuscript provides no details on whether the external baselines match total parameters (including any extra capacity allocated to VPE's autoregressive semantic head), training compute, data, or optimization procedure. This leaves open the possibility that the delta arises from capacity or end-to-end training differences rather than joint access to input and semantic plan.
  2. [Abstract] Abstract (and results sections): The assumption that the image decoder jointly accesses the original input and semantic plan to avoid the external bottleneck is presented as the mechanism, but no ablation or controlled comparison isolates this factor from unified training or architectural differences. Without such isolation (e.g., via matched two-stage internal variants), the attribution remains unverified.
minor comments (2)
  1. [Abstract] The abstract mentions validation 'covering various token types and model architectures' but does not name the specific internal frameworks, tokenizers, or additional metrics used beyond PSNR; adding these would improve reproducibility.
  2. [Abstract] Presentation of quantitative results would benefit from error bars, number of runs, or baseline implementation details to allow assessment of the reported PSNR values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with clarifications and indicate where revisions will be made to improve transparency and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'through internal integration, VPE achieves substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale' is load-bearing for the paper's contribution, yet the manuscript provides no details on whether the external baselines match total parameters (including any extra capacity allocated to VPE's autoregressive semantic head), training compute, data, or optimization procedure. This leaves open the possibility that the delta arises from capacity or end-to-end training differences rather than joint access to input and semantic plan.

    Authors: We appreciate the referee's emphasis on ensuring fair comparisons. The external baselines such as X-Omni consist of a separate autoregressive semantic model plus an independent decoder, and our reported comparisons use models with matched total parameter counts (VPE's unified architecture allocates capacity to the semantic head internally within the same budget). To eliminate any ambiguity, the revised manuscript will include an explicit table in the appendix detailing parameter counts, training compute, datasets, and optimization settings for VPE and all baselines. revision: yes

  2. Referee: [Abstract] Abstract (and results sections): The assumption that the image decoder jointly accesses the original input and semantic plan to avoid the external bottleneck is presented as the mechanism, but no ablation or controlled comparison isolates this factor from unified training or architectural differences. Without such isolation (e.g., via matched two-stage internal variants), the attribution remains unverified.

    Authors: In unified models such as Transfusion and Show-o2, the image token generation stage operates within a single transformer that can attend jointly to the original conditioning (text or image) and the autoregressively generated semantic tokens via cross-attention and shared layers. This joint access is architecturally unavailable in external two-stage pipelines. We agree that an explicit ablation against a matched internal two-stage variant would further isolate the effect; however, training such variants requires substantial additional compute. In the revision we will expand the discussion and method sections to provide a more detailed architectural analysis of the joint-access mechanism and its relation to the observed PSNR gains. revision: partial

Circularity Check

0 steps flagged

No circularity: architectural proposal with empirical validation only

full rationale

The paper proposes Visual Prompt Engineering (VPE) as an internal architectural integration into existing frameworks like Transfusion or Show-o2, where the model autoregressively generates semantic tokens before image tokens. No equations, derivations, fitted parameters, or mathematical claims are present in the text. The central claim of improved editing preservation (PSNR 26.76 vs 19.92) is presented as an empirical outcome of joint access within one model, without any reduction to self-citations, ansatzes, or renamed known results. The comparison to external pipelines is stated directly as a design difference, not derived from prior author work or by construction. This is a standard non-circular empirical architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no model equations, training objectives, or parameter counts are visible, so ledger entries are limited to the core domain assumption stated in the text.

axioms (1)
  • domain assumption Visual semantic tokens (e.g., from SigLIP 2) capture sufficient semantic layout to serve as effective conditioning for full image generation.
    Invoked as the basis for the visual prompt step that reduces modeling difficulty.

pith-pipeline@v0.9.1-grok · 5828 in / 1240 out tokens · 27279 ms · 2026-06-28T07:10:30.171859+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 16 linked inside Pith

  1. [1]

    Show-o: One single trans- former to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single trans- former to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

  2. [2]

    Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

  3. [3]

    Emu3: Next-token prediction is all you need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

  4. [4]

    Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024

    Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024

  5. [5]

    Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  6. [6]

    X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

    Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

  7. [7]

    Blip3o-next: Next frontier of native image generation

    Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, et al. Blip3o-next: Next frontier of native image generation. arXiv preprint arXiv:2510.15857, 2025

  8. [8]

    Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  9. [9]

    Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

  10. [10]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  11. [11]

    Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  12. [12]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  13. [13]

    Cogview: Mastering text-to-image generation via transformers.Advances in neural information processing systems, 34:19822–19835, 2021

    Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers.Advances in neural information processing systems, 34:19822–19835, 2021

  14. [14]

    Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022

  15. [15]

    Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

  16. [16]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 10

  17. [17]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

  18. [18]

    Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

  19. [19]

    Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

    Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025

  20. [20]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  21. [21]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  22. [22]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

  23. [23]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

  24. [24]

    Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024

  25. [25]

    Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  26. [26]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  27. [27]

    Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  28. [28]

    Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

  29. [29]

    Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

  30. [30]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

  31. [31]

    Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023. 11

  32. [32]

    Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

  33. [33]

    Editar: Unified conditional generation with autoregressive models

    Jiteng Mu, Nuno Vasconcelos, and Xiaolong Wang. Editar: Unified conditional generation with autoregressive models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7899–7909, 2025

  34. [34]

    Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

  35. [35]

    Null-text inversion for editing real images using guided diffusion models

    Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6038–6047, 2023

  36. [36]

    Direct inversion: Boosting diffusion-based editing with 3 lines of code.arXiv preprint arXiv:2310.01506, 2023

    Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based editing with 3 lines of code.arXiv preprint arXiv:2310.01506, 2023

  37. [37]

    Unicedit-10m: A dataset and benchmark breaking the scale-quality barrier via unified verification for reasoning-enriched edits.arXiv preprint arXiv:2512.02790, 2025

    Keming Ye, Zhipeng Huang, Canmiao Fu, Qingyang Liu, Jiani Cai, Zheqi Lv, Chen Li, Jing Lyu, Zhou Zhao, and Shengyu Zhang. Unicedit-10m: A dataset and benchmark breaking the scale-quality barrier via unified verification for reasoning-enriched edits.arXiv preprint arXiv:2512.02790, 2025

  38. [38]

    Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

    Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

  39. [39]

    Smartedit: Exploring complex instruction- based image editing with multimodal large language models

    Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. Smartedit: Exploring complex instruction- based image editing with multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8362–8371, 2024

  40. [40]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  41. [41]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

  42. [42]

    Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

    Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

  43. [43]

    Textdiffuser-2: Unleashing the power of language models for text rendering

    Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024

  44. [44]

    Illume: Illuminating your llms to see, draw, and self-enhance

    Chunwei Wang, Guansong Lu, Junwei Yang, Runhui Huang, Jianhua Han, Lu Hou, Wei Zhang, and Hang Xu. Illume: Illuminating your llms to see, draw, and self-enhance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21612–21622, 2025

  45. [45]

    Dual diffusion for unified image generation and understanding

    Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2779–2790, 2025

  46. [46]

    Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

  47. [47]

    Textatlas5m: A large-scale dataset for dense text image generation.arXiv preprint arXiv:2502.07870, 2025

    Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, et al. Textatlas5m: A large-scale dataset for dense text image generation.arXiv preprint arXiv:2502.07870, 2025. 12

  48. [48]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  49. [49]

    Nohumansrequired: Autonomous high-quality image editing triplet mining

    Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh, Georgii Fedorov, Bulat Suleimanov, Vladimir Dokholyan, and Aleksandr Gordeev. Nohumansrequired: Autonomous high-quality image editing triplet mining. InProceedings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, pages 6059–6068, 2026

  50. [50]

    CE only” indicates that only cross-entropy loss on SigLIP 2 tokens is used (no flow matching loss). “Flow only

    Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025. 13 A Full Class-to-Image Results Tables 6 and 7 report the complete evaluation metrics at more checkpoints for the class-to-image...