InterleaveThinker: Reinforcing Agentic Interleaved Generation
Pith reviewed 2026-06-27 06:40 UTC · model grok-4.3
The pith
A planner-critic pipeline adds interleaved text-image generation to any existing image generator.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
InterleaveThinker is the first multi-agent pipeline that endows any image generator with interleaved generation capabilities by employing a planner agent to organize image-text input sequences and a critic agent to evaluate generator outputs, identify deviations, and refine instructions for regeneration. The system builds Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k for format cold-start, then uses Interleave-Critic-RL-13k with GRPO to reinforce step-wise instruction correction. Because full-trajectory optimization over 25+ generator calls is impractical, accuracy reward and step-wise reward are proposed so that single-step RL can guide the entire trajectory, yielding performanc
What carries the argument
The planner-critic multi-agent pipeline trained first with SFT then reinforced by single-step RL using accuracy and step-wise rewards.
If this is right
- Performance improves across various image generators on interleaved tasks.
- Results reach levels comparable to Nano Banana and GPT-5 on interleaved generation benchmarks.
- Base models show substantial gains on reasoning benchmarks such as WISE and RISE when using 4-step FLUX.2-klein.
Where Pith is reading between the lines
- The same planner-critic structure could be tested on sequential generation tasks beyond images, such as video frame sequences.
- Single-step RL with these rewards may lower the compute barrier for applying agentic oversight to even longer generation chains.
- Agentic correction layers might serve as a general way to add capabilities missing from base generator architectures without retraining them.
Load-bearing premise
The accuracy and step-wise rewards used in single-step RL will steer the full multi-step interleaved trajectory without the critic introducing compounding errors over 25 or more generator calls.
What would settle it
Measure output quality on an interleaved generation benchmark that requires trajectories longer than 25 generator calls; if quality shows no gain or degrades relative to the base generator, the central claim does not hold.
Figures
read the original abstract
Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces InterleaveThinker, a multi-agent pipeline consisting of a planner agent that organizes text-image sequences for any base image generator and a critic agent that evaluates outputs, detects deviations, and refines instructions for regeneration. It constructs Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k for format cold-start SFT, then Interleave-Critic-RL-13k for GRPO-based reinforcement of the critic using proposed accuracy and step-wise rewards. The design addresses the impracticality of full-trajectory optimization for sequences with over 25 generator calls, claiming performance gains across generators, comparability to Nano Banana and GPT-5 on interleaved benchmarks, and substantial improvements on WISE and RISE for 4-step FLUX.2-klein.
Significance. If the results hold under rigorous verification, the work would be significant for demonstrating a practical agentic method to retrofit interleaved generation onto existing image generators without retraining their weights. The explicit construction of large-scale SFT datasets and the reward formulation to enable single-step RL approximation of multi-step trajectories are concrete contributions that could be adopted or extended in vision-language agent research.
major comments (1)
- [Abstract] Abstract: The central claim that the accuracy reward and step-wise reward 'allow single-step RL to effectively guide the entire generation trajectory' is load-bearing for the reported benchmark comparability and WISE/RISE gains, yet no ablation, error accumulation analysis, or stability check over 25+ steps is referenced to confirm that critic corrections do not introduce compounding deviations; this assumption directly determines whether the single-step GRPO design supports the full pipeline results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract's central claim. We address the concern point by point below and commit to strengthening the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the accuracy reward and step-wise reward 'allow single-step RL to effectively guide the entire generation trajectory' is load-bearing for the reported benchmark comparability and WISE/RISE gains, yet no ablation, error accumulation analysis, or stability check over 25+ steps is referenced to confirm that critic corrections do not introduce compounding deviations; this assumption directly determines whether the single-step GRPO design supports the full pipeline results.
Authors: We agree that the absence of explicit ablations on error accumulation and long-horizon stability leaves the claim under-supported in the current manuscript. The step-wise reward is formulated to deliver immediate per-step feedback that corrects deviations before propagation, while the accuracy reward anchors overall trajectory fidelity; the observed gains on WISE/RISE for 4-step FLUX.2-klein and parity with Nano Banana/GPT-5 on interleaved benchmarks provide indirect empirical corroboration. Nevertheless, to directly validate that critic interventions do not compound over 25+ generator calls, we will add a dedicated analysis section (including ablation of the step-wise reward, per-step deviation tracking, and stability metrics on extended trajectories) in the revised manuscript. revision: yes
Circularity Check
No significant circularity; derivation relies on external benchmarks and standard RL techniques
full rationale
The paper presents a multi-agent pipeline (planner + critic) trained via SFT on constructed datasets followed by GRPO RL with accuracy and step-wise rewards. No equations, fitted parameters renamed as predictions, or self-citation chains reduce any central claim to its own inputs by construction. Results are evaluated against external benchmarks (WISE, RISE) and compared to independent models (Nano Banana, GPT-5). The design choice to use single-step RL for multi-step trajectories is an explicit engineering assumption, not a self-referential derivation.
Axiom & Free-Parameter Ledger
free parameters (2)
- accuracy reward and step-wise reward formulation
- dataset sizes and construction rules for Interleave-Planner-SFT-80k etc.
axioms (1)
- domain assumption The critic agent can reliably detect and correct deviations from the planner's instructions in complex interleaved sequences
Reference graph
Works this paper leans on
-
[1]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
2024
-
[2]
FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025
Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025
2025
-
[3]
Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
Pith/arXiv arXiv 2025
-
[4]
Longcat-image technical report.arXiv preprint arXiv:2512.07584, 2025
Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584, 2025
Pith/arXiv arXiv 2025
-
[5]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024
2024
-
[6]
Ovis-u1 technical report.arXiv preprint arXiv:2506.23044, 2025
Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, et al. Ovis-u1 technical report.arXiv preprint arXiv:2506.23044, 2025
arXiv 2025
-
[7]
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
Pith/arXiv arXiv 2025
-
[8]
Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025
Pith/arXiv arXiv 2025
-
[9]
Nano banana
Google. Nano banana. 2025
2025
-
[10]
Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
Pith/arXiv arXiv 2025
-
[11]
Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025
Pith/arXiv arXiv 2025
-
[12]
Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025
Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025
Pith/arXiv arXiv 2025
-
[13]
Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, et al. Architecture decoupling is not all you need for unified multimodal model.arXiv preprint arXiv:2511.22663, 2025
Pith/arXiv arXiv 2025
-
[14]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020
2020
-
[15]
Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013
Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013
Pith/arXiv arXiv 2013
-
[16]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InICCV, 2023
2023
-
[17]
Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...
Pith/arXiv arXiv 2025
-
[18]
Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
Pith/arXiv arXiv 2025
-
[19]
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
Pith/arXiv arXiv 2023
-
[20]
Gen-searcher: Reinforcing agentic search for image generation.arXiv preprint arXiv:2603.28767, 2026
Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, and Xiangyu Yue. Gen-searcher: Reinforcing agentic search for image generation.arXiv preprint arXiv:2603.28767, 2026
Pith/arXiv arXiv 2026
-
[21]
Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025
Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025
Pith/arXiv arXiv 2025
-
[22]
Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026
Zhipu AI. Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026
2026
-
[23]
Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, et al. Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025. 12 InterleaveThinker: Reinforcing Agentic Interleaved Generation
arXiv 2025
-
[24]
Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025
Pith/arXiv arXiv 2025
-
[25]
Nano-banana-pro
Google. Nano-banana-pro. Accessed November, 2025 [Online] https://deepmind.google/models/ gemini-image/pro/, 2025
2025
-
[26]
Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026
Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026
arXiv 2026
-
[27]
Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024
Pith/arXiv arXiv 2024
-
[28]
Dian Zheng, Manyuan Zhang, Hongyu Li, Hongbo Liu, Kai Zou, Kaituo Feng, and Hongsheng Li. Uni-edit: Intelligent editing is a general task for unified model tuning.arXiv preprint arXiv:2605.21487, 2026
Pith/arXiv arXiv 2026
-
[29]
Duogen: Towards general purpose interleaved multimodal generation
Min Shi, Xiaohui Zeng, Jiannan Huang, Yin Cui, Francesco Ferroni, Jialuo Li, Shubham Pachori, Zhaoshuo Li, Yogesh Balaji, Haoxiang Wang, Tsung-Yi Lin, Xiao Fu, Yue Zhao, Chieh-Yun Chen, Ming-Yu Liu, and Humphrey Shi. Duogen: Towards general purpose interleaved multimodal generation. InCVPR, 2026
2026
-
[30]
Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025
Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025
Pith/arXiv arXiv 2025
-
[31]
Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, et al. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026
arXiv 2026
-
[32]
Yuhao Dong, Zuyan Liu, Shulin Tian, Yongming Rao, and Ziwei Liu. Insight-v++: Towards advanced long-chain visual reasoning with multimodal large language models.arXiv preprint arXiv:2603.18118, 2026
arXiv 2026
-
[33]
Self-refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. InNeurIPS, 2023
2023
-
[34]
Reflexion: Language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InNeurIPS, 2023
2023
-
[35]
Genartist: Multimodal llm as an agent for unified image generation and editing
Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu. Genartist: Multimodal llm as an agent for unified image generation and editing. InNeurIPS, 2024
2024
-
[36]
Idea2img: Iterative self-refinement with gpt-4v for automatic image design and generation
Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. Idea2img: Iterative self-refinement with gpt-4v for automatic image design and generation. InECCV, 2024
2024
-
[37]
Reflect-dit: Inference-time scaling for text-to-image diffusion transformers via in-context reflection
Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Reflect-dit: Inference-time scaling for text-to-image diffusion transformers via in-context reflection. InICCV, 2025
2025
-
[38]
From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning
Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, Peng Gao, Mohamed Elhoseiny, and Hongsheng Li. From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning. In ICCV, 2025
2025
-
[39]
Reasonedit: Towards reasoning-enhanced image editing models.arXiv preprint arXiv:2511.22625, 2025
Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, et al. Reasonedit: Towards reasoning-enhanced image editing models.arXiv preprint arXiv:2511.22625, 2025
arXiv 2025
-
[40]
Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang, Zichuan Liu, Xin Lu, Boxi Wu, and Deng Cai. Thinkrl-edit: Thinking in reinforcement learning for reasoning-centric image editing.arXiv preprint arXiv:2601.03467, 2026
arXiv 2026
-
[41]
Gemini 2.5 pro.https://deepmind.google/models/gemini/pro/, 2025
Google DeepMind. Gemini 2.5 pro.https://deepmind.google/models/gemini/pro/, 2025
2025
-
[42]
Viescore: Towards explainable metrics for conditional image synthesis evaluation
Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InACL, 2024
2024
-
[43]
Comm: A coherent interleaved image-text dataset for multimodal understanding and generation
Wei Chen, Lin Li, Yongqi Yang, Bin Wen, Fan Yang, Tingting Gao, Yu Wu, and Long Chen. Comm: A coherent interleaved image-text dataset for multimodal understanding and generation. InCVPR, 2025
2025
-
[44]
Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Pith/arXiv arXiv 2025
-
[45]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
Pith/arXiv arXiv 2025
-
[46]
Ueval: A benchmark for unified multimodal generation.arXiv preprint arXiv:2601.22155, 2026
Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu, and Zhuang Liu. Ueval: A benchmark for unified multimodal generation.arXiv preprint arXiv:2601.22155, 2026
arXiv 2026
-
[47]
Experiment with gemini 2.0 flash native image generation, march 2025.URL https://developers
Kat Kampf and Nicole Brichtova. Experiment with gemini 2.0 flash native image generation, march 2025.URL https://developers. googleblog. com/en/experiment-with-gemini-20-flash-native-image-generation/. Accessed, 2025
2025
-
[48]
Show-o2: Improved native unified multimodal models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. InNeurIPS, 2025. 13 InterleaveThinker: Reinforcing Agentic Interleaved Generation
2025
-
[49]
Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025
Pith/arXiv arXiv 2025
-
[50]
Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025
Pith/arXiv arXiv 2025
-
[51]
Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing.arXiv preprint arXiv:2504.02826, 2025
arXiv 2025
-
[52]
Minigpt-5: Interleaved vision-and-language generation via generative vokens
Kaizhi Zheng, Xuehai He, and Xin Eric Wang. Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239, 2023
arXiv 2023
-
[53]
Making llama see and draw with seed tokenizer.arXiv preprint arXiv:2310.01218, 2023
Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer.arXiv preprint arXiv:2310.01218, 2023
arXiv 2023
-
[54]
Generative multimodal models are in-context learners
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InCVPR, 2024
2024
-
[55]
Gpt-image-1
OpenAI. Gpt-image-1. 2025
2025
-
[56]
Stable diffusion 3.5 large
Stability AI. Stable diffusion 3.5 large. https://huggingface.co/stabilityai/stable-diffusion-3. 5-large, 2024
2024
-
[57]
Seedream 4.0
ByteDance. Seedream 4.0. 2025
2025
-
[58]
{text_input}
Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, 2025. 14 InterleaveThinker: Reinforcing Agentic Interleaved Generation A System Prompt Pl...
2025
-
[59]
Every step in your execution plan MUST represent an actual image generation or image editing action
**Dynamic Step Count (Image Operations Only)**: Determine the necessary number of steps. Every step in your execution plan MUST represent an actual image generation or image editing action. **DO NOT** create separate steps solely for generating text, captions, or summaries
-
[60]
For visual or creative tasks, the final step MUST result in a fully colored, detailed, and polished output
**Complete & Polished Output**: Always aim for a fully realized final product. For visual or creative tasks, the final step MUST result in a fully colored, detailed, and polished output. Do not stop at a draft, outline, or uncolored sketch unless the user explicitly requests it
-
[61]
**Text Generation & Auxiliary Text Rule**: - If the user specifically asks to render or draw text *inside* the image, include this requirement within the `instruction`field. - If the user explicitly asks for a *separate* text response (e.g., a caption, summary, explanation, or knowledge grounding) to accompany the image, generate this text and place it in...
-
[62]
add a red hat
**Prompt Optimization for All Steps**: Convert the`instruction`of EVERY step into a highly effective prompt in the `prompt`field. - **Step 1 (Generation)**: Create a highly detailed T2I prompt representing the foundational stage. Focus *only* on the Step 1 instruction. Do NOT hallucinate unmentioned details or future elements. - **Subsequent Steps (Editin...
-
[63]
Step 1:",
**CRITICAL**: The`prompt`field MUST contain ONLY the pure text prompt or editing instruction. DO NOT include meta-text, prefixes (such as "Step 1:", "Prompt:", "Edit:"), or conversational filler. It must be directly usable by the generation/editing API. ## Output The output consists of two parts:
-
[65]
The optimized, pure T2I prompt suitable for the image generation model. (No 'Step 1:' prefix)
A JSON -- Planing each step and rewrite the instruction to prompt suitable for generation/editing. Here is a output example <think> Part 1: Planning analysis explaining the execution plan. Part 2: Analysis of how the instructions were translated into visual keywords for the T2I prompt and editing instructions. </think> <answer> { 'execution_plan': [ {'ste...
-
[66]
**Task Identification & Modality Routing**: Carefully analyze the input to determine the task type. - **Task A (General Text Response / Problem Solving / Image-to-Text)**: If the user provides a complete sequence of images and asks for text responses for each step (e.g., describing the images, solving a problem, explaining a process, or answering question...
-
[67]
(3)", "Step 3:
**Strict Step Count & NO Prefix Rule**: - **Step Count**: Determine the logical number of steps. **CRITICAL**: If the user's input explicitly specifies the number of steps required, you MUST strictly output exactly that number of steps to fulfill the requirement. If continuing a sequence (Task B), your`step_number`MUST start exactly from where the user's ...
-
[68]
You MUST set this to`null` for Task A
**Field Definitions & Usage**: -`instruction`: The detailed, pure text content or action for the editing step (Task B). You MUST set this to`null` for Task A. (Strictly NO step prefixes). -`prompt`: The optimized, pure instruction suitable for the **image editing model** to execute the change based on the previous image (Task B). You MUST set this to`null...
-
[69]
## Output The output consists of two parts:
**Complete Output**: Ensure the final step achieves a complete resolution of the user's goal based on the sequence context. ## Output The output consists of two parts:
-
[70]
A Statement - Just an dummy reasoning
-
[71]
Detailed instruction for this step (Task B). Output null if this is Task A. Strictly NO prefixes like 'Step i:' or '(i)'
A JSON -- Planing each step and rewrite the instruction to prompt suitable for generation/editing. Here is a output example <think> </think> <answer> { 'execution_plan': [ {'step_number': i, 'step_name': 'Short name for the step', 'instruction': "Detailed instruction for this step (Task B). Output null if this is Task A. Strictly NO prefixes like 'Step i:...
-
[72]
Evaluate the edited image and output the result in boolean format (True/False)
-
[73]
{original_instruction}
If you think the edited image is not good enough (False), generate an optimized rewritten prompt that addresses the original shortcomings; if you think it is good enough (True), output the [Original Rewritten Prompt]. ## Input Information You have been presented with two images in sequence: - Original Image: The input image before editing. (NOTE: For the ...
-
[74]
Otherwise, observe the delta (differences)
**Criterion A (Intent Matching)**: If the Before Image is pure white, evaluate if the After Image successfully generated the Previous Step from scratch. Otherwise, observe the delta (differences). Did the changes match the key meaning and necessary details of the Previous Step?
-
[75]
Fault Finder
**Criterion B (Anomaly & Logic Detection - CRITICAL)**: You must actively play the role of a "Fault Finder". Do NOT just check if the requested object exists; you MUST check HOW it exists. Scan the After Image for any of the following fatal errors: - **Anatomical/Biological Errors**: Extra/missing limbs or fingers, body parts emerging from impossible or a...
-
[76]
*(If Rewritten Prompt is empty, directly compare Original Instruction→Result)
**What went wrong?** - Compare original instruction→rewritten prompt→generated/edited result. *(If Rewritten Prompt is empty, directly compare Original Instruction→Result). * - Identify gaps between intent and execution - Determine if the issue is clarity, specificity, or contradiction
-
[77]
maintain [aspect]
**Refinement Approaches:** **If this is an Initial Generation task (Before image was blank):** - **Establish Foundation:** Translate the raw user instruction into a comprehensive Text-to-Image prompt. - **Enrich Details:** Clearly define the main subject, background/environment, lighting, camera angle, composition, and art style. - **Prevent Ambiguity:** ...
-
[78]
**Leverage All Information:** - Reference what's visible in the original image - Learn from what the previous rewritten prompt missed - Use the edited image as feedback on what went wrong - Maintain what worked, fix what didn't ## Output The output consists of three parts:
-
[79]
A Statement - Analysis process and reasoning
-
[80]
A Boolean - Judge whether the edited images is good enough
-
[81]
Fault Finder
A prompt -- either the optimized rewritten prompt or the original rewritten prompt. Here is a output example: <think> Detailed explanation of evaluation and new rewritten prompt. If edited image is good enough, explain why it meets requirements. If not good enough, explain specific shortcomings. </think> <answer> { 'previous_step_success': 'boolean (True ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.