Imagine Before You Draw: Visual Prompt Engineering for Image Generation
Pith reviewed 2026-06-28 07:10 UTC · model grok-4.3
The pith
Integrating visual semantic token generation inside image models avoids external pipeline bottlenecks and improves editing preservation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VPE has the model autoregressively generate visual semantic tokens such as SigLIP 2 as visual prompts that capture semantic layout, then generate full image tokens conditioned on this plan inside the same unified architecture. This internal step enables joint access to input and plan, unlike external two-stage pipelines that feed semantic tokens from one model to an independent decoder.
What carries the argument
Internal autoregressive generation of visual semantic tokens as prompts that guide subsequent image token generation within one unified model.
If this is right
- Accelerates convergence during training across class-conditional, text-to-image, and editing setups.
- Raises quality ceilings for generated images while using the same model scale.
- Delivers higher editing preservation measured at PSNR 26.76 compared with 19.92 for external alternatives.
- Maintains competitive editing responsiveness without extra parameters.
- Applies to various token types and both autoregressive and diffusion-based internal architectures.
Where Pith is reading between the lines
- Single-model systems could handle planning and rendering stages more efficiently by reusing internal representations rather than passing data between separate components.
- Similar internal prompting might extend to video generation where temporal consistency requires joint access to semantic plans and pixel output.
- The method could reduce the parameter overhead of maintaining parallel external planners for downstream editing workflows.
Load-bearing premise
Integrating semantic token generation inside the same model allows the image decoder to jointly access both the original input and the semantic plan.
What would settle it
An external two-stage system reaching PSNR 26 or higher in editing tasks at the same parameter scale would undermine the claimed benefit of internal integration.
Figures
read the original abstract
Incorporating visual semantic representations as an intermediate step before image generation can reduce the modeling difficulty between text and images, thereby improving generation quality. Recent works such as X-Omni and BLIP3o-Next have explored this direction, but they typically use a two-stage external pipeline: a separate autoregressive model first generates semantic tokens, which are then fed as conditioning to an independent diffusion decoder. Since the decoder cannot jointly access the original input and the semantic plan, this design introduces an information bottleneck that limits detail preservation in downstream tasks such as editing. Internal architectures such as Transfusion, BAGEL, and Show-o2 avoid this bottleneck by enabling cross-modal interaction within a single model, but they still face the difficult text-to-pixel modeling gap without intermediate semantic guidance. We propose Visual Prompt Engineering (VPE), which can be seamlessly integrated into such internal frameworks. Specifically, the model first autoregressively generates visual semantic tokens (e.g., SigLIP 2) as "visual prompts" that capture the semantic layout, then generates the full image tokens conditioned on this plan. We validate VPE across class-conditional generation, text-to-image generation, and image editing, covering various token types and model architectures. Results show that VPE can accelerate convergence, raise quality ceilings, and through internal integration, achieve substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale, while maintaining competitive editing responsiveness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Visual Prompt Engineering (VPE), which integrates autoregressive generation of visual semantic tokens (e.g., SigLIP 2) as internal 'visual prompts' before full image token generation within unified models such as Transfusion or Show-o2. This is positioned as avoiding the information bottleneck of external two-stage pipelines (e.g., X-Omni) while providing semantic guidance absent in pure internal architectures. Validation is claimed across class-conditional generation, text-to-image, and image editing, with a highlighted quantitative result of improved editing preservation (PSNR 26.76 vs. 19.92) at the same parameter scale and competitive responsiveness.
Significance. If the reported PSNR gains and convergence improvements hold under matched experimental conditions, the work would demonstrate a practical internal mechanism for semantic planning that improves detail preservation in editing without external pipelines. The empirical coverage across tasks and token types provides a useful data point for unified multimodal architectures.
major comments (2)
- [Abstract] Abstract: The central claim that 'through internal integration, VPE achieves substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale' is load-bearing for the paper's contribution, yet the manuscript provides no details on whether the external baselines match total parameters (including any extra capacity allocated to VPE's autoregressive semantic head), training compute, data, or optimization procedure. This leaves open the possibility that the delta arises from capacity or end-to-end training differences rather than joint access to input and semantic plan.
- [Abstract] Abstract (and results sections): The assumption that the image decoder jointly accesses the original input and semantic plan to avoid the external bottleneck is presented as the mechanism, but no ablation or controlled comparison isolates this factor from unified training or architectural differences. Without such isolation (e.g., via matched two-stage internal variants), the attribution remains unverified.
minor comments (2)
- [Abstract] The abstract mentions validation 'covering various token types and model architectures' but does not name the specific internal frameworks, tokenizers, or additional metrics used beyond PSNR; adding these would improve reproducibility.
- [Abstract] Presentation of quantitative results would benefit from error bars, number of runs, or baseline implementation details to allow assessment of the reported PSNR values.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with clarifications and indicate where revisions will be made to improve transparency and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'through internal integration, VPE achieves substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale' is load-bearing for the paper's contribution, yet the manuscript provides no details on whether the external baselines match total parameters (including any extra capacity allocated to VPE's autoregressive semantic head), training compute, data, or optimization procedure. This leaves open the possibility that the delta arises from capacity or end-to-end training differences rather than joint access to input and semantic plan.
Authors: We appreciate the referee's emphasis on ensuring fair comparisons. The external baselines such as X-Omni consist of a separate autoregressive semantic model plus an independent decoder, and our reported comparisons use models with matched total parameter counts (VPE's unified architecture allocates capacity to the semantic head internally within the same budget). To eliminate any ambiguity, the revised manuscript will include an explicit table in the appendix detailing parameter counts, training compute, datasets, and optimization settings for VPE and all baselines. revision: yes
-
Referee: [Abstract] Abstract (and results sections): The assumption that the image decoder jointly accesses the original input and semantic plan to avoid the external bottleneck is presented as the mechanism, but no ablation or controlled comparison isolates this factor from unified training or architectural differences. Without such isolation (e.g., via matched two-stage internal variants), the attribution remains unverified.
Authors: In unified models such as Transfusion and Show-o2, the image token generation stage operates within a single transformer that can attend jointly to the original conditioning (text or image) and the autoregressively generated semantic tokens via cross-attention and shared layers. This joint access is architecturally unavailable in external two-stage pipelines. We agree that an explicit ablation against a matched internal two-stage variant would further isolate the effect; however, training such variants requires substantial additional compute. In the revision we will expand the discussion and method sections to provide a more detailed architectural analysis of the joint-access mechanism and its relation to the observed PSNR gains. revision: partial
Circularity Check
No circularity: architectural proposal with empirical validation only
full rationale
The paper proposes Visual Prompt Engineering (VPE) as an internal architectural integration into existing frameworks like Transfusion or Show-o2, where the model autoregressively generates semantic tokens before image tokens. No equations, derivations, fitted parameters, or mathematical claims are present in the text. The central claim of improved editing preservation (PSNR 26.76 vs 19.92) is presented as an empirical outcome of joint access within one model, without any reduction to self-citations, ansatzes, or renamed known results. The comparison to external pipelines is stated directly as a design difference, not derived from prior author work or by construction. This is a standard non-circular empirical architecture paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Visual semantic tokens (e.g., from SigLIP 2) capture sufficient semantic layout to serve as effective conditioning for full image generation.
Reference graph
Works this paper leans on
-
[1]
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single trans- former to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024
Pith/arXiv arXiv 2024
-
[2]
Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025
Pith/arXiv arXiv 2025
-
[3]
Emu3: Next-token prediction is all you need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024
Pith/arXiv arXiv 2024
-
[4]
Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024
Pith/arXiv arXiv 2024
-
[5]
Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
Pith/arXiv arXiv 2025
-
[6]
Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025
arXiv 2025
-
[7]
Blip3o-next: Next frontier of native image generation
Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, et al. Blip3o-next: Next frontier of native image generation. arXiv preprint arXiv:2510.15857, 2025
arXiv 2025
-
[8]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025
Pith/arXiv arXiv 2025
-
[9]
Neural discrete representation learning.Advances in neural information processing systems, 30, 2017
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017
2017
-
[10]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022
2022
-
[11]
Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
Pith/arXiv arXiv 2022
-
[12]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[13]
Cogview: Mastering text-to-image generation via transformers.Advances in neural information processing systems, 34:19822–19835, 2021
Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers.Advances in neural information processing systems, 34:19822–19835, 2021
2021
-
[14]
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022
Pith/arXiv arXiv 2022
-
[15]
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024
Pith/arXiv arXiv 2024
-
[16]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 10
2021
-
[17]
Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024
2024
-
[18]
Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024
2024
-
[19]
Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis
Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025
2025
-
[20]
Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
2020
-
[21]
Scaling rectified flow trans- formers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
2024
-
[22]
Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024
2024
-
[23]
Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
2022
-
[24]
Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024
2024
-
[25]
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
Pith/arXiv arXiv 2025
-
[26]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
2021
-
[27]
Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
Pith/arXiv arXiv 2023
-
[28]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022
Pith/arXiv arXiv 2022
-
[29]
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024
Pith/arXiv arXiv 2024
-
[30]
Instructpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023
2023
-
[31]
Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023
Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023. 11
2023
-
[32]
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024
Pith/arXiv arXiv 2024
-
[33]
Editar: Unified conditional generation with autoregressive models
Jiteng Mu, Nuno Vasconcelos, and Xiaolong Wang. Editar: Unified conditional generation with autoregressive models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7899–7909, 2025
2025
-
[34]
Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022
Pith/arXiv arXiv 2022
-
[35]
Null-text inversion for editing real images using guided diffusion models
Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6038–6047, 2023
2023
-
[36]
Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based editing with 3 lines of code.arXiv preprint arXiv:2310.01506, 2023
arXiv 2023
-
[37]
Keming Ye, Zhipeng Huang, Canmiao Fu, Qingyang Liu, Jiani Cai, Zheqi Lv, Chen Li, Jing Lyu, Zhou Zhao, and Shengyu Zhang. Unicedit-10m: A dataset and benchmark breaking the scale-quality barrier via unified verification for reasoning-enriched edits.arXiv preprint arXiv:2512.02790, 2025
arXiv 2025
-
[38]
Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025
arXiv 2025
-
[39]
Smartedit: Exploring complex instruction- based image editing with multimodal large language models
Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. Smartedit: Exploring complex instruction- based image editing with multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8362–8371, 2024
2024
-
[40]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
2009
-
[41]
Improving image generation with better captions
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023
2023
-
[42]
Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023
Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023
arXiv 2023
-
[43]
Textdiffuser-2: Unleashing the power of language models for text rendering
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024
2024
-
[44]
Illume: Illuminating your llms to see, draw, and self-enhance
Chunwei Wang, Guansong Lu, Junwei Yang, Runhui Huang, Jianhua Han, Lu Hou, Wei Zhang, and Hang Xu. Illume: Illuminating your llms to see, draw, and self-enhance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21612–21622, 2025
2025
-
[45]
Dual diffusion for unified image generation and understanding
Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2779–2790, 2025
2025
-
[46]
Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025
Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025
Pith/arXiv arXiv 2025
-
[47]
Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, et al. Textatlas5m: A large-scale dataset for dense text image generation.arXiv preprint arXiv:2502.07870, 2025. 12
arXiv 2025
-
[48]
Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023
2023
-
[49]
Nohumansrequired: Autonomous high-quality image editing triplet mining
Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh, Georgii Fedorov, Bulat Suleimanov, Vladimir Dokholyan, and Aleksandr Gordeev. Nohumansrequired: Autonomous high-quality image editing triplet mining. InProceedings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, pages 6059–6068, 2026
2026
-
[50]
Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025. 13 A Full Class-to-Image Results Tables 6 and 7 report the complete evaluation metrics at more checkpoints for the class-to-image...
arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.