pith. sign in

arxiv: 2606.01985 · v1 · pith:XKJKW3DCnew · submitted 2026-06-01 · 💻 cs.CV

MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching

Pith reviewed 2026-06-28 15:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-turn image editingflow matchingreinforcement learningreward aggregationexposure biasimage editingGRPONFT
0
0 comments X

The pith

MT-EditFlow uses flow-matching reinforcement learning to optimize multi-turn image editing rewards by broadcasting aggregated advantages across trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Single-turn image editing models fail in multi-turn use because one bad edit ruins the sequence and errors compound through exposure bias. The paper introduces MT-EditFlow, a flow-matching RL method that combines multi-turn planning with multi-reward signals usable in GRPO and NFT training. Systematic tests of reward aggregation, VLM scoring modes, and advantage fusion show that broadcasting the aggregated advantage over the full trajectory links local edits to overall success. This produces measurable gains, including a 6.85-point lift on FLUX.1-Kontext-dev at turn 3 while keeping high per-turn success rates.

Core claim

MT-EditFlow is a flow-matching reinforcement learning framework that optimizes reward signals for sequential image editing through a multi-turn perspective and multi-reward formulation applicable to both GRPO and NFT-based methods. The central discovery is that broadcasting the aggregated advantage across the entire editing trajectory bridges local planning and global multi-turn task success, reducing exposure bias without reward hacking when combined with suitable VLM reasoning modes.

What carries the argument

Broadcasting the aggregated advantage across the entire editing trajectory, which links local per-turn decisions to global multi-turn success.

If this is right

  • Performance improves across diverse base models including FLUX.1-Kontext-dev.
  • Turn-3 overall scores rise by 6.85 points and surpass Qwen-Image-Edit.
  • Marginal success rates stay high while exposure bias falls.
  • The same multi-reward structure works for both GRPO and NFT reinforcement learning.
  • The approach supplies a foundation for reliable iterative human-AI image collaboration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The advantage-broadcasting pattern could transfer to other sequential generation domains such as video or 3D asset editing.
  • VLM-based reward variance might be further lowered by mixing multiple reasoning modes within a single training run.
  • Interactive editing interfaces could become more stable if users receive turn-level feedback derived from the aggregated advantage signal.

Load-bearing premise

The assumption that broadcasting aggregated advantage across the editing trajectory bridges local planning and global multi-turn success without introducing reward hacking or bias from VLM scoring.

What would settle it

A direct comparison in which models trained with per-turn advantages instead of broadcast aggregated advantages show no gain or a drop in turn-3 overall performance relative to the single-turn baseline.

Figures

Figures reproduced from arXiv: 2606.01985 by Jiahui Huang, Jianwen Xie, Mingyuan Zhou, Nanzhu Wang, Oscar Leong, Shu Wang, Tianyu Chen, Yasi Zhang, Ying Nian Wu.

Figure 1
Figure 1. Figure 1: Qualitative results of MT-EditFlow (Ours). [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the reward bias and variance trade-off between think [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multi-turn editing performance in￾creases with the number of training turns K used [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on Reward Signal Design. The top row shows training reward, and the bottom row shows held-out evaluation reward. Fine-grained 1–5 scoring pro￾vides a more effectivenss learning signal than binary variants, thinking mode produces better performance with more accurate but higher-variance rewards, and advantage￾level fusion consistently outperforms reward-level fusion [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 6
Figure 6. Figure 6: Ablation on λCC. Higher λCC improves CC at the cost of IF; λCC = 0.3 achieves the best overall balance. 5.1 Main Results Our experiments reveal several important insights regarding MT-EditFlow: 1. Substantial Multi-turn Improvement: As shown in Tab. 1, MT-EditFlow significantly improves overall multi-turn performance (EdiVal-O) over the corresponding open-source backbones, narrowing the gap to strong close… view at source ↗
Figure 7
Figure 7. Figure 7: Ablation on group size G and discretization steps T. We found that a smaller group size G = 16 and a larger discretization step T = 12 lead to better performance. Impact of Content Consistency Weight λCC [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overview of MT-EditFlow. The figure shows the overall MT-EditFlow pipeline, including multi-turn trajectory sampling and the corresponding training pro￾cedure. B Training Settings B.1 Reward Model Setup We use Qwen3-VL-8B-Instruct as the automatic instruction-following reward model. During training, we query the multi-turn evaluator once per trajec￾tory using the reference image, three turn edited images, … view at source ↗
Figure 9
Figure 9. Figure 9: Training data distribution. Left: distribution of 6,957 sequential edit steps across 9 editing categories. Right: object-category distribution of 2,319 reference images across 12 semantic categories. as indicative runtime measurements rather than strict efficiency benchmarks, since the runs differ in backbone size and RL objective [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Within-group reward variance analysis. Raw IF and CC rewards can have substantially different within-group scales and variances. This mismatch is one of the main motivations for performing fusion at the advantage level rather than directly in raw reward space. E.3 Group Size G on FLUX.1-Kontext-dev [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: IF/CC reward statistics related to group-size selection. [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visual quality across turns. Turn-level visual quality remains strong across the editing trajectory and generally stays above the reference-image baseline, indicating that MT-EditFlow improves multi-turn performance without obvious visual-quality collapse [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
read the original abstract

Recent breakthroughs in instruction-based image editing have captured significant attention, as models are now capable of handling real-world editing demands with the practicality required by everyday users. However, editing models trained primarily for single-turn edits often break down in multi-turn editing--the natural interactive setting where a user iteratively refines an image based on the model's own previous outputs. This failure stems from the all-or-nothing requirement, where a single failed turn compromises the entire sequence, and error propagation, where exposure bias leads to compounding editing errors. To address these challenges, we introduce MT-EditFlow, a flow-matching reinforcement learning framework designed to optimize reward signals for sequential image editing. MT-EditFlow integrates a multi-turn perspective with a multi-reward formulation to provide a unified structure applicable to both GRPO and NFT-based reinforcement learning methods. We systematically analyze and optimize the reward signal by investigating effective scoring strategies for turn-level aggregation, VLM reasoning modes to trade off reward bias and variance, and advantage fusion levels to prevent reward hacking. Our findings reveal that broadcasting the aggregated advantage across the entire editing trajectory effectively bridges the gap between local planning and global multi-turn task success. Extensive experiments demonstrate that MT-EditFlow significantly improves performance across diverse base models. Notably, it boosts FLUX.1-Kontext-dev by 6.85 points in turn-3 overall performance, surpassing state-of-the-art open-source models such as Qwen-Image-Edit. By maintaining high marginal success rates and reducing exposure bias, MT-EditFlow provides a foundation for more reliable and natural human-AI collaboration in visual content creation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces MT-EditFlow, a flow-matching reinforcement learning framework for multi-turn image editing that combines a multi-reward formulation with aggregated advantage broadcasting to address error propagation and exposure bias in sequential edits. It analyzes turn-level aggregation, VLM reasoning modes for bias-variance trade-offs, and advantage fusion levels, claiming that broadcasting aggregated advantages bridges local planning and global success. Extensive experiments are reported to show performance gains across base models, including a 6.85-point boost in turn-3 overall performance for FLUX.1-Kontext-dev that surpasses Qwen-Image-Edit.

Significance. If the reported gains prove robust under statistical validation and the multi-reward strategy demonstrably avoids propagating VLM biases, the work would offer a practical advance for interactive, multi-turn image editing systems that better support natural human-AI collaboration.

major comments (2)
  1. [Abstract and experimental results section] The central empirical claims (e.g., the 6.85-point gain on FLUX.1-Kontext-dev and surpassing of Qwen-Image-Edit) are presented without error bars, multiple-run statistics, or dataset specifications. These omissions are load-bearing because they prevent assessment of whether the improvements exceed noise or are reproducible across the claimed diverse base models.
  2. [Reward strategy analysis and multi-reward formulation] The analysis of reward strategies claims that broadcasting aggregated advantage mitigates exposure bias without reward hacking or VLM bias propagation, yet no concrete ablation or metric is shown that isolates VLM scoring on semantic fidelity versus superficial features (e.g., color harmony) or tests for turn-dependent drift. This directly affects the validity of the multi-turn success-rate improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our empirical reporting and reward analysis. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and experimental results section] The central empirical claims (e.g., the 6.85-point gain on FLUX.1-Kontext-dev and surpassing of Qwen-Image-Edit) are presented without error bars, multiple-run statistics, or dataset specifications. These omissions are load-bearing because they prevent assessment of whether the improvements exceed noise or are reproducible across the claimed diverse base models.

    Authors: We agree that error bars, multiple-run statistics, and explicit dataset specifications are necessary for assessing statistical robustness and reproducibility. The full manuscript reports results across multiple base models with the stated gains, but these details were not fully highlighted in the abstract and results summary. In the revision we will add error bars from multiple independent runs, report standard deviations, and provide complete dataset specifications to allow direct evaluation of whether the 6.85-point improvement exceeds noise. revision: yes

  2. Referee: [Reward strategy analysis and multi-reward formulation] The analysis of reward strategies claims that broadcasting aggregated advantage mitigates exposure bias without reward hacking or VLM bias propagation, yet no concrete ablation or metric is shown that isolates VLM scoring on semantic fidelity versus superficial features (e.g., color harmony) or tests for turn-dependent drift. This directly affects the validity of the multi-turn success-rate improvements.

    Authors: The paper presents ablations on turn-level aggregation, VLM reasoning modes, and advantage fusion levels, showing that aggregated advantage broadcasting improves multi-turn success. However, we acknowledge the need for more targeted metrics that isolate semantic fidelity from superficial features and explicit checks for turn-dependent drift. We will add these specific ablations and metrics in the revised version to further substantiate the claims about bias mitigation and exposure bias reduction. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents MT-EditFlow as an empirical RL framework for multi-turn image editing, with performance gains (e.g., +6.85 on FLUX.1-Kontext-dev) reported from experiments on reward aggregation, VLM scoring, and advantage broadcasting. No equations, derivations, or self-citations appear in the provided text that reduce any claimed prediction or result to its inputs by construction. The multi-reward formulation and trajectory-level broadcasting are design choices whose effectiveness is assessed via external benchmarks and ablation, not by definitional equivalence or fitted-input renaming. The derivation chain is therefore self-contained against the reported metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5852 in / 1054 out tokens · 25241 ms · 2026-06-28T15:05:33.103864+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 31 canonical work pages · 23 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 3, 4, 5

  2. [2]

    Hallucination of Multimodal Large Language Models: A Survey

    Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., Shou, M.Z.: Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930 (2024) 4

  3. [3]

    Training Diffusion Models with Reinforcement Learning

    Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2023) 4

  4. [4]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023) 4

  5. [5]

    Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Win- ter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford...

  6. [6]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Chang, Y., Zhang, Y., Fang, Z., Wu, Y.N., Bisk, Y., Gao, F.: Skews in the phe- nomenon space hinder generalization in text-to-image generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 422–439. Springer Nature Switzerland, Cham (2025) 4

  7. [7]

    In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition

    Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 14455–14465 (2024) 4

  8. [8]

    In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=YkV0fnXgJA2, 3, 4, 7, 8, 10, 11

    Chen, T., Zhang, Y., Zhang, Z., Yu, P., Wang, S., Wang, Z., Lin, K., Wang, X., Yang, Z., Li, L., Lin, C.C., Xie, J., Leong, O., Wang, L., Wu, Y.N., Zhou, M.: Edival-agent: An object-centric framework for automated, fine-grained evaluation of multi-turn editing. In: The Fourteenth International Conference on Learning Representations (2026),https://openrevi...

  9. [9]

    Advances in Neural Information Processing Systems37, 135062–135093 (2024) 4

    Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. Advances in Neural Information Processing Systems37, 135062–135093 (2024) 4

  10. [10]

    Deepmind, G.: Gemini 2.5: Pushing the frontier with advanced reasoning, mul- timodality, long context, and next generation agentic capabilities (2025),https: //arxiv.org/abs/2507.062612, 4

  11. [11]

    Advances in Neural Information Processing Sys- tems36, 79858–79885 (2023) 4

    Fan, Y., Watkins, O., Du, Y., Liu, H., Ryu, M., Boutilier, C., Abbeel, P., Ghavamzadeh, M., Lee, K., Lee, K.: Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. Advances in Neural Information Processing Sys- tems36, 79858–79885 (2023) 4

  12. [12]

    Seedream 3.0 Technical Report

    Gao, Y., Gong, L., Guo, Q., Hou, X., Lai, Z., Li, F., Li, L., Lian, X., Liao, C., Liu, L., et al.: Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346 (2025) 2, 4

  13. [13]

    Huang et al

    Gemini2, G.: Experiment with gemini 2.0 flash native image generation.https:// developers.googleblog.com/en/experiment-with-gemini-20-flash-native- image-generation/(2025), accessed: 2025-06-22 4 16 J. Huang et al

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025) 4

  15. [15]

    In: ICLR (2023) 4

    Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross-attention control. In: ICLR (2023) 4

  16. [16]

    Advances in neural information processing systems33, 6840–6851 (2020) 4, 5

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 4, 5

  17. [17]

    Classifier-Free Diffusion Guidance

    Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022) 5

  18. [18]

    Iclr1(2), 3 (2022) 11

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022) 11

  19. [19]

    arXiv preprint arXiv:2404.09990 (2024) 4

    Hui, M., Yang, S., Zhao, B., Shi, Y., Wang, H., Wang, P., Zhou, Y., Xie, C.: Hq- edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990 (2024) 4

  20. [20]

    Ku, M., Jiang, D., Wei, C., Yue, X., Chen, W.: Viescore: Towards explainable metricsforconditionalimagesynthesisevaluation.arXivpreprintarXiv:2312.14867 (2023) 4

  21. [21]

    International journal of computer vision128(7), 1956–1981 (2020) 10

    Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Ka- mali, S., Popov, S., Malloci, M., Kolesnikov, A., et al.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision128(7), 1956–1981 (2020) 10

  22. [22]

    Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025) 2, 3, 11

  23. [23]

    Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

  24. [24]

    Li, J., Cui, Y., Huang, T., Ma, Y., Fan, C., Yang, M., Zhong, Z.: Mixgrpo: Unlock- ing flow-based grpo efficiency with mixed ode-sde (2025),https://arxiv.org/ abs/2507.218025

  25. [25]

    Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

    Li, Z., Liu, Z., Zhang, Q., Lin, B., Yuan, S., Yan, Z., Ye, Y., Yu, W., Niu, Y., Yuan, L.: Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprint arXiv:2510.16888 (2025) 3, 5, 7

  26. [26]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 4

    Liang, Y., He, J., Li, G., Li, P., Klimovskiy, A., Carolan, N., Sun, J., Pont-Tuset, J., Young, S., Yang, F., Ke, J., Dvijotham, K.D., Collins, K., Luo, Y., Li, Y., Kohlhoff, K.J., Ramachandran, D., Navalpakkam, V.: Rich human feedback for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 4

  27. [27]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 2, 4, 5

  28. [28]

    In: The Thirty- ninthAnnualConferenceonNeuralInformationProcessingSystems(2025),https: //openreview.net/forum?id=oCBKGw5HNf2, 4, 6, 11

    Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P., ZHANG, D., Ouyang, W.: Flow-GRPO: Training flow matching models via online RL. In: The Thirty- ninthAnnualConferenceonNeuralInformationProcessingSystems(2025),https: //openreview.net/forum?id=oCBKGw5HNf2, 4, 6, 11

  29. [29]

    Step1X-Edit: A Practical Framework for General Image Editing

    Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) 2, 3, 4, 7

  30. [30]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022) 2, 4, 5 Reinforcement Learning for Multi-Turn Image Editing with Flow Matching 17

  31. [31]

    In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=E7YpL4L4Xh3, 5, 7

    Luo, X., Wang, J., Wu, C., Xiao, S., Jiang, X., Lian, D., Zhang, J., Liu, D., Liu, Z.: Editscore: Unlocking online RL for image editing via high-fidelity reward modeling. In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=E7YpL4L4Xh3, 5, 7

  32. [32]

    Ad- vances in Neural Information Processing Systems37, 41494–41516 (2024) 4

    Ma, Y., Ji, J., Ye, K., Lin, W., Wang, Z., Zheng, Y., Zhou, Q., Sun, X., Ji, R.: I2ebench: A comprehensive benchmark for instruction-based image editing. Ad- vances in Neural Information Processing Systems37, 41494–41516 (2024) 4

  33. [33]

    Ma, Y., Wu, X., Sun, K., Li, H.: Hpsv3: Towards wide-spectrum human preference score (2025),https://arxiv.org/abs/2508.037894

  34. [34]

    com / index / introducing-4o-image-generation/(2025), accessed: 2025-06-22 2, 4

    OpenAI: Introducing 4o image generation.https : / / openai . com / index / introducing-4o-image-generation/(2025), accessed: 2025-06-22 2, 4

  35. [35]

    Lvlm- count: Enhancing the counting ability of large vision-language models, 2026

    Qharabagh, M.F., Ghofrani, M., Fountoulakis, K.: Lvlm-count: Enhancing the counting ability of large vision-language models. arXiv preprint arXiv:2412.00686 (2024) 4

  36. [36]

    Qian, Y., Bocek-Rivele, E., Song, L., Tong, J., Yang, Y., Lu, J., Hu, W., Gan, Z.: Pico-banana-400k: A large-scale dataset for text-guided image editing (2025), https://arxiv.org/abs/2510.1980810

  37. [37]

    arXiv preprint arXiv:2505.11493 , year=

    Qian, Y., Lu, J., Fu, T.J., Wang, X., Chen, C., Yang, Y., Hu, W., Gan, Z.: Gie- bench: Towards grounded evaluation for text-guided image editing. arXiv preprint arXiv:2505.11493 (2025) 8

  38. [38]

    In: SC20: international conference for high performance computing, networking, storage and analysis

    Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: Memory optimizations to- ward training trillion parameter models. In: SC20: international conference for high performance computing, networking, storage and analysis. pp. 1–16. IEEE (2020) 11

  39. [39]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 2, 4

  40. [40]

    Generalization in generation: A closer look at exposure bias.arXiv preprint arXiv:1910.00292, 2019

    Schmidt, F.: Generalization in generation: A closer look at exposure bias. arXiv preprint arXiv:1910.00292 (2019) 2

  41. [41]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) 4

  42. [42]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Sheynin, S., Polyak, A., Singer, U., Kirstain, Y., Zohar, A., Ashual, O., Parikh, D., Taigman, Y.: Emu edit: Precise image editing via recognition and generation tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8871–8879 (2024) 7

  43. [43]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025) 5

  44. [44]

    Advances in neural information processing systems34, 1415–1428 (2021) 4

    Song, Y., Durkan, C., Murray, I., Ermon, S.: Maximum likelihood training of score- based diffusion models. Advances in neural information processing systems34, 1415–1428 (2021) 4

  45. [45]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020) 4, 6

  46. [46]

    Journal of Cognitive Neuroscience11(1), 126–134 (1999) 2, 4, 5

    Sutton, R.S., Barto, A.G., et al.: Reinforcement learning. Journal of Cognitive Neuroscience11(1), 126–134 (1999) 2, 4, 5

  47. [47]

    Vision Language Models are Biased

    Vo, A., Nguyen, K.N., Taesiri, M.R., Dang, V.T., Nguyen, A.T., Kim, D.: Vision language models are biased. arXiv preprint arXiv:2505.23941 (2025) 4 18 J. Huang et al

  48. [48]

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

  49. [49]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., Liu, Z., Xia, Z., Li, C., Deng, H., Wang, J., Luo, K., Zhang, B., Lian, D., Wang, X., Wang, Z., Huang, T., Liu, Z.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025) 2, 4

  50. [50]

    In: The Fourteenth In- ternational Conference on Learning Representations (2026),https://openreview

    Wu, K., Jiang, S., Ku, M., Nie, P., Liu, M., Chen, W.: Editreward: A human- aligned reward model for instruction-guided image editing. In: The Fourteenth In- ternational Conference on Learning Representations (2026),https://openreview. net/forum?id=eZu358JOOR3, 7

  51. [51]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13294–13304 (2025) 2, 4

  52. [52]

    In: Proceedingsofthe37thInternationalConferenceonNeuralInformationProcessing Systems

    Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: learning and evaluating human preferences for text-to-image generation. In: Proceedingsofthe37thInternationalConferenceonNeuralInformationProcessing Systems. pp. 15903–15935 (2023) 4

  53. [53]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Xue, Z., Wu, J., Gao, Y., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al.: Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818 (2025) 2, 4

  54. [54]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 3

  55. [55]

    arXiv preprint arXiv:2504.13143 , year=

    Yang, S., Hui, M., Zhao, B., Zhou, Y., Ruiz, N., Xie, C.: Complex-edit: Cot-like instruction generation for complexity-controllable image editing benchmark. arXiv preprint arXiv:2504.13143 (2025) 2, 4

  56. [56]

    In: The Thirty-ninth Annual Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track (2025),https://openreview.net/forum?id=uUCSrMlfD32, 3, 4, 7, 11

    Ye, Y., He, X., Li, Z., Lin, B., Yuan, S., Yan, Z., Hou, B., Yuan, L.: Imgedit: A unified image editing dataset and benchmark. In: The Thirty-ninth Annual Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track (2025),https://openreview.net/forum?id=uUCSrMlfD32, 3, 4, 7, 11

  57. [57]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y.: Anyedit: Mastering unified high-quality image editing for any idea. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26125–26135 (2025) 4, 7

  58. [58]

    Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 4, 7

    Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 4, 7

  59. [59]

    In: Leonardis, A., Ricci, E., Roth, S., Rus- sakovsky, O., Sattler, T., Varol, G

    Zhang,Y.,Yu,P.,Wu,Y.N.:Object-conditionedenergy-basedattentionmapalign- ment in text-to-image diffusion models. In: Leonardis, A., Ricci, E., Roth, S., Rus- sakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 55–71. Springer Nature Switzerland, Cham (2025) 4

  60. [60]

    Zhang, Y., Yu, P., Zhu, Y., Chang, Y., Gao, F., Wu, Y.N., Leong, O.: Flow pri- ors for linear inverse problems via iterative corrupted trajectory matching. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=1H2e7USI094 Reinforcement Learning for Multi-Turn Image Editing with Flow Matching 19

  61. [61]

    In: The Thirteenth International Conference on Learning Rep- resentations (2025),https://openreview.net/forum?id=84pDoCD4lH4

    Zhang, Z., Hu, F., Lee, J., Shi, F., Kordjamshidi, P., Chai, J., Ma, Z.: Do vision- language models represent space and how? evaluating spatial frame of reference under ambiguities. In: The Thirteenth International Conference on Learning Rep- resentations (2025),https://openreview.net/forum?id=84pDoCD4lH4

  62. [62]

    Advances in Neural Information Processing Systems37, 3058–3093 (2024) 4, 7

    Zhao, H., Ma, X.S., Chen, L., Si, S., Wu, R., An, K., Yu, P., Zhang, M., Li, Q., Chang, B.: Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems37, 3058–3093 (2024) 4, 7

  63. [63]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Zhao,Y.,Gu,A.,Varma,R.,Luo,L.,Huang,C.C.,Xu,M.,Wright,L.,Shojanazeri, H., Ott, M., Shleifer, S., et al.: Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277 (2023) 11

  64. [64]

    In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=VJZ477R89F2, 5, 6, 11 20 J

    Zheng, K., Chen, H., Ye, H., Wang, H., Zhang, Q., Jiang, K., Su, H., Ermon, S., Zhu, J., Liu, M.Y.: DiffusionNFT: Online diffusion reinforcement with forward process. In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=VJZ477R89F2, 5, 6, 11 20 J. Huang et al. A Overview Fig.8: Overview of MT-EditF...

  65. [69]

    Then give final score 1-5 in <Score> <Thought> [Analysis here] </Thought> <Score>X</Score> Multi-turn IF prompt

    Fully Compliant: Instruction accurately and completely executed, all non-targeted original content perfectly preserved Provide analysis in <Thought> tag covering: instruction execution accuracy and completeness, preservation of non-targeted elements, unintended changes. Then give final score 1-5 in <Score> <Thought> [Analysis here] </Thought> <Score>X</Sc...

  66. [70]

    Failed: Instruction completely ignored/opposite changes made, or critical original content destroyed

  67. [71]

    Minimal: Only minor parts of instruction followed, major elements missing/wrong, or severe content loss/unintended changes

  68. [72]

    Partial: Key instruction elements followed but incomplete/ inaccurate, with noticeable original content loss/unintended modifications

  69. [73]

    Mostly Compliant: Instruction largely executed correctly with minor flaws, original content well-preserved with minimal unintended changes

  70. [74]

    Fully Compliant: Instruction accurately and completely executed, all non-targeted original content perfectly preserved Provide analysis in <Thought> tag covering each turn’s execution. Then give scores in format: <Thought> [Turn 1 analysis] [Turn 2 analysis] [Turn 3 analysis] </Thought> <Turn1Score>X</Turn1Score> <Turn2Score>X</Turn2Score> <Turn3Score>X</...