MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching

Jiahui Huang; Jianwen Xie; Mingyuan Zhou; Nanzhu Wang; Oscar Leong; Shu Wang; Tianyu Chen; Yasi Zhang; Ying Nian Wu

arxiv: 2606.01985 · v1 · pith:XKJKW3DCnew · submitted 2026-06-01 · 💻 cs.CV

MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching

Jiahui Huang , Yasi Zhang , Tianyu Chen , Shu Wang , Jianwen Xie , Oscar Leong , Mingyuan Zhou , Nanzhu Wang

show 1 more author

Ying Nian Wu

This is my paper

Pith reviewed 2026-06-28 15:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-turn image editingflow matchingreinforcement learningreward aggregationexposure biasimage editingGRPONFT

0 comments

The pith

MT-EditFlow uses flow-matching reinforcement learning to optimize multi-turn image editing rewards by broadcasting aggregated advantages across trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Single-turn image editing models fail in multi-turn use because one bad edit ruins the sequence and errors compound through exposure bias. The paper introduces MT-EditFlow, a flow-matching RL method that combines multi-turn planning with multi-reward signals usable in GRPO and NFT training. Systematic tests of reward aggregation, VLM scoring modes, and advantage fusion show that broadcasting the aggregated advantage over the full trajectory links local edits to overall success. This produces measurable gains, including a 6.85-point lift on FLUX.1-Kontext-dev at turn 3 while keeping high per-turn success rates.

Core claim

MT-EditFlow is a flow-matching reinforcement learning framework that optimizes reward signals for sequential image editing through a multi-turn perspective and multi-reward formulation applicable to both GRPO and NFT-based methods. The central discovery is that broadcasting the aggregated advantage across the entire editing trajectory bridges local planning and global multi-turn task success, reducing exposure bias without reward hacking when combined with suitable VLM reasoning modes.

What carries the argument

Broadcasting the aggregated advantage across the entire editing trajectory, which links local per-turn decisions to global multi-turn success.

If this is right

Performance improves across diverse base models including FLUX.1-Kontext-dev.
Turn-3 overall scores rise by 6.85 points and surpass Qwen-Image-Edit.
Marginal success rates stay high while exposure bias falls.
The same multi-reward structure works for both GRPO and NFT reinforcement learning.
The approach supplies a foundation for reliable iterative human-AI image collaboration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The advantage-broadcasting pattern could transfer to other sequential generation domains such as video or 3D asset editing.
VLM-based reward variance might be further lowered by mixing multiple reasoning modes within a single training run.
Interactive editing interfaces could become more stable if users receive turn-level feedback derived from the aggregated advantage signal.

Load-bearing premise

The assumption that broadcasting aggregated advantage across the editing trajectory bridges local planning and global multi-turn success without introducing reward hacking or bias from VLM scoring.

What would settle it

A direct comparison in which models trained with per-turn advantages instead of broadcast aggregated advantages show no gain or a drop in turn-3 overall performance relative to the single-turn baseline.

Figures

Figures reproduced from arXiv: 2606.01985 by Jiahui Huang, Jianwen Xie, Mingyuan Zhou, Nanzhu Wang, Oscar Leong, Shu Wang, Tianyu Chen, Yasi Zhang, Ying Nian Wu.

**Figure 2.** Figure 2: Illustration of the reward bias and variance trade-off between think [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Multi-turn editing performance increases with the number of training turns K used [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 5.** Figure 5: Ablation on Reward Signal Design. The top row shows training reward, and the bottom row shows held-out evaluation reward. Fine-grained 1–5 scoring provides a more effectivenss learning signal than binary variants, thinking mode produces better performance with more accurate but higher-variance rewards, and advantagelevel fusion consistently outperforms reward-level fusion [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 6.** Figure 6: Ablation on λCC. Higher λCC improves CC at the cost of IF; λCC = 0.3 achieves the best overall balance. 5.1 Main Results Our experiments reveal several important insights regarding MT-EditFlow: 1. Substantial Multi-turn Improvement: As shown in Tab. 1, MT-EditFlow significantly improves overall multi-turn performance (EdiVal-O) over the corresponding open-source backbones, narrowing the gap to strong close… view at source ↗

**Figure 7.** Figure 7: Ablation on group size G and discretization steps T. We found that a smaller group size G = 16 and a larger discretization step T = 12 lead to better performance. Impact of Content Consistency Weight λCC [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Overview of MT-EditFlow. The figure shows the overall MT-EditFlow pipeline, including multi-turn trajectory sampling and the corresponding training procedure. B Training Settings B.1 Reward Model Setup We use Qwen3-VL-8B-Instruct as the automatic instruction-following reward model. During training, we query the multi-turn evaluator once per trajectory using the reference image, three turn edited images, … view at source ↗

**Figure 9.** Figure 9: Training data distribution. Left: distribution of 6,957 sequential edit steps across 9 editing categories. Right: object-category distribution of 2,319 reference images across 12 semantic categories. as indicative runtime measurements rather than strict efficiency benchmarks, since the runs differ in backbone size and RL objective [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Within-group reward variance analysis. Raw IF and CC rewards can have substantially different within-group scales and variances. This mismatch is one of the main motivations for performing fusion at the advantage level rather than directly in raw reward space. E.3 Group Size G on FLUX.1-Kontext-dev [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗

**Figure 11.** Figure 11: IF/CC reward statistics related to group-size selection. [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗

**Figure 12.** Figure 12: Visual quality across turns. Turn-level visual quality remains strong across the editing trajectory and generally stays above the reference-image baseline, indicating that MT-EditFlow improves multi-turn performance without obvious visual-quality collapse [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗

read the original abstract

Recent breakthroughs in instruction-based image editing have captured significant attention, as models are now capable of handling real-world editing demands with the practicality required by everyday users. However, editing models trained primarily for single-turn edits often break down in multi-turn editing--the natural interactive setting where a user iteratively refines an image based on the model's own previous outputs. This failure stems from the all-or-nothing requirement, where a single failed turn compromises the entire sequence, and error propagation, where exposure bias leads to compounding editing errors. To address these challenges, we introduce MT-EditFlow, a flow-matching reinforcement learning framework designed to optimize reward signals for sequential image editing. MT-EditFlow integrates a multi-turn perspective with a multi-reward formulation to provide a unified structure applicable to both GRPO and NFT-based reinforcement learning methods. We systematically analyze and optimize the reward signal by investigating effective scoring strategies for turn-level aggregation, VLM reasoning modes to trade off reward bias and variance, and advantage fusion levels to prevent reward hacking. Our findings reveal that broadcasting the aggregated advantage across the entire editing trajectory effectively bridges the gap between local planning and global multi-turn task success. Extensive experiments demonstrate that MT-EditFlow significantly improves performance across diverse base models. Notably, it boosts FLUX.1-Kontext-dev by 6.85 points in turn-3 overall performance, surpassing state-of-the-art open-source models such as Qwen-Image-Edit. By maintaining high marginal success rates and reducing exposure bias, MT-EditFlow provides a foundation for more reliable and natural human-AI collaboration in visual content creation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MT-EditFlow combines flow matching with multi-turn RL and aggregated VLM rewards to cut exposure bias in iterative image editing, with reported gains on FLUX, but the bias-mitigation claims need direct verification.

read the letter

The core contribution is a flow-matching RL setup that treats multi-turn editing as a sequential process and uses a multi-reward formulation plus advantage broadcasting to keep later turns from collapsing when early ones are imperfect. It applies this to both GRPO and NFT, tests turn-level aggregation, VLM reasoning modes for bias-variance trade-offs, and different fusion levels, then shows a 6.85-point lift on FLUX.1-Kontext-dev at turn 3 while claiming to beat Qwen-Image-Edit.

The analysis of reward strategies is the part that stands out: they actually break down how broadcasting the aggregate advantage is supposed to align local edits with global success and check for hacking. That is concrete and directly addresses the exposure-bias problem the abstract flags.

The soft spot is still the VLM reward itself. If the scorer systematically prefers color harmony or surface coherence over semantic fidelity, or drifts across turns, broadcasting the advantage will just propagate the bias rather than correct it. The abstract says they analyzed modes and fusion to avoid this, but without seeing the actual VLM prompts, inter-annotator agreement, or any human validation of the scores, the 6.85-point number is hard to trust as more than fitting to the judge. No error bars or dataset details appear in the provided text either.

This is for labs already running instruction-based editing pipelines who need multi-turn stability. It is worth a serious referee because the multi-turn framing and reward-ablation structure are clear enough to evaluate, even if the VLM component requires extra scrutiny. I would send it to review but flag the reward validation as the key thing to check.

Referee Report

2 major / 0 minor

Summary. The paper introduces MT-EditFlow, a flow-matching reinforcement learning framework for multi-turn image editing that combines a multi-reward formulation with aggregated advantage broadcasting to address error propagation and exposure bias in sequential edits. It analyzes turn-level aggregation, VLM reasoning modes for bias-variance trade-offs, and advantage fusion levels, claiming that broadcasting aggregated advantages bridges local planning and global success. Extensive experiments are reported to show performance gains across base models, including a 6.85-point boost in turn-3 overall performance for FLUX.1-Kontext-dev that surpasses Qwen-Image-Edit.

Significance. If the reported gains prove robust under statistical validation and the multi-reward strategy demonstrably avoids propagating VLM biases, the work would offer a practical advance for interactive, multi-turn image editing systems that better support natural human-AI collaboration.

major comments (2)

[Abstract and experimental results section] The central empirical claims (e.g., the 6.85-point gain on FLUX.1-Kontext-dev and surpassing of Qwen-Image-Edit) are presented without error bars, multiple-run statistics, or dataset specifications. These omissions are load-bearing because they prevent assessment of whether the improvements exceed noise or are reproducible across the claimed diverse base models.
[Reward strategy analysis and multi-reward formulation] The analysis of reward strategies claims that broadcasting aggregated advantage mitigates exposure bias without reward hacking or VLM bias propagation, yet no concrete ablation or metric is shown that isolates VLM scoring on semantic fidelity versus superficial features (e.g., color harmony) or tests for turn-dependent drift. This directly affects the validity of the multi-turn success-rate improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our empirical reporting and reward analysis. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and experimental results section] The central empirical claims (e.g., the 6.85-point gain on FLUX.1-Kontext-dev and surpassing of Qwen-Image-Edit) are presented without error bars, multiple-run statistics, or dataset specifications. These omissions are load-bearing because they prevent assessment of whether the improvements exceed noise or are reproducible across the claimed diverse base models.

Authors: We agree that error bars, multiple-run statistics, and explicit dataset specifications are necessary for assessing statistical robustness and reproducibility. The full manuscript reports results across multiple base models with the stated gains, but these details were not fully highlighted in the abstract and results summary. In the revision we will add error bars from multiple independent runs, report standard deviations, and provide complete dataset specifications to allow direct evaluation of whether the 6.85-point improvement exceeds noise. revision: yes
Referee: [Reward strategy analysis and multi-reward formulation] The analysis of reward strategies claims that broadcasting aggregated advantage mitigates exposure bias without reward hacking or VLM bias propagation, yet no concrete ablation or metric is shown that isolates VLM scoring on semantic fidelity versus superficial features (e.g., color harmony) or tests for turn-dependent drift. This directly affects the validity of the multi-turn success-rate improvements.

Authors: The paper presents ablations on turn-level aggregation, VLM reasoning modes, and advantage fusion levels, showing that aggregated advantage broadcasting improves multi-turn success. However, we acknowledge the need for more targeted metrics that isolate semantic fidelity from superficial features and explicit checks for turn-dependent drift. We will add these specific ablations and metrics in the revised version to further substantiate the claims about bias mitigation and exposure bias reduction. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents MT-EditFlow as an empirical RL framework for multi-turn image editing, with performance gains (e.g., +6.85 on FLUX.1-Kontext-dev) reported from experiments on reward aggregation, VLM scoring, and advantage broadcasting. No equations, derivations, or self-citations appear in the provided text that reduce any claimed prediction or result to its inputs by construction. The multi-reward formulation and trajectory-level broadcasting are design choices whose effectiveness is assessed via external benchmarks and ablation, not by definitional equivalence or fitted-input renaming. The derivation chain is therefore self-contained against the reported metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5852 in / 1054 out tokens · 25241 ms · 2026-06-28T15:05:33.103864+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 31 canonical work pages · 23 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Hallucination of Multimodal Large Language Models: A Survey

Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., Shou, M.Z.: Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930 (2024) 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Training Diffusion Models with Reinforcement Learning

Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2023) 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023) 4

2023
[5]

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Win- ter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[6]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Chang, Y., Zhang, Y., Fang, Z., Wu, Y.N., Bisk, Y., Gao, F.: Skews in the phe- nomenon space hinder generalization in text-to-image generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 422–439. Springer Nature Switzerland, Cham (2025) 4

2024
[7]

In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition

Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 14455–14465 (2024) 4

2024
[8]

In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=YkV0fnXgJA2, 3, 4, 7, 8, 10, 11

Chen, T., Zhang, Y., Zhang, Z., Yu, P., Wang, S., Wang, Z., Lin, K., Wang, X., Yang, Z., Li, L., Lin, C.C., Xie, J., Leong, O., Wang, L., Wu, Y.N., Zhou, M.: Edival-agent: An object-centric framework for automated, fine-grained evaluation of multi-turn editing. In: The Fourteenth International Conference on Learning Representations (2026),https://openrevi...

2026
[9]

Advances in Neural Information Processing Systems37, 135062–135093 (2024) 4

Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. Advances in Neural Information Processing Systems37, 135062–135093 (2024) 4

2024
[10]

Deepmind, G.: Gemini 2.5: Pushing the frontier with advanced reasoning, mul- timodality, long context, and next generation agentic capabilities (2025),https: //arxiv.org/abs/2507.062612, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Advances in Neural Information Processing Sys- tems36, 79858–79885 (2023) 4

Fan, Y., Watkins, O., Du, Y., Liu, H., Ryu, M., Boutilier, C., Abbeel, P., Ghavamzadeh, M., Lee, K., Lee, K.: Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. Advances in Neural Information Processing Sys- tems36, 79858–79885 (2023) 4

2023
[12]

Seedream 3.0 Technical Report

Gao, Y., Gong, L., Guo, Q., Hou, X., Lai, Z., Li, F., Li, L., Lian, X., Liao, C., Liu, L., et al.: Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346 (2025) 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Huang et al

Gemini2, G.: Experiment with gemini 2.0 flash native image generation.https:// developers.googleblog.com/en/experiment-with-gemini-20-flash-native- image-generation/(2025), accessed: 2025-06-22 4 16 J. Huang et al

2025
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

In: ICLR (2023) 4

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross-attention control. In: ICLR (2023) 4

2023
[16]

Advances in neural information processing systems33, 6840–6851 (2020) 4, 5

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 4, 5

2020
[17]

Classifier-Free Diffusion Guidance

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022) 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Iclr1(2), 3 (2022) 11

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022) 11

2022
[19]

arXiv preprint arXiv:2404.09990 (2024) 4

Hui, M., Yang, S., Zhao, B., Shi, Y., Wang, H., Wang, P., Zhou, Y., Xie, C.: Hq- edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990 (2024) 4

work page arXiv 2024
[20]

Ku, M., Jiang, D., Wei, C., Yue, X., Chen, W.: Viescore: Towards explainable metricsforconditionalimagesynthesisevaluation.arXivpreprintarXiv:2312.14867 (2023) 4

work page arXiv 2023
[21]

International journal of computer vision128(7), 1956–1981 (2020) 10

Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Ka- mali, S., Popov, S., Malloci, M., Kolesnikov, A., et al.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision128(7), 1956–1981 (2020) 10

1956
[22]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025) 2, 3, 11

2025
[23]

Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Li, J., Cui, Y., Huang, T., Ma, Y., Fan, C., Yang, M., Zhong, Z.: Mixgrpo: Unlock- ing flow-based grpo efficiency with mixed ode-sde (2025),https://arxiv.org/ abs/2507.218025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Li, Z., Liu, Z., Zhang, Q., Lin, B., Yuan, S., Yan, Z., Ye, Y., Yu, W., Niu, Y., Yuan, L.: Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprint arXiv:2510.16888 (2025) 3, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 4

Liang, Y., He, J., Li, G., Li, P., Klimovskiy, A., Carolan, N., Sun, J., Pont-Tuset, J., Young, S., Yang, F., Ke, J., Dvijotham, K.D., Collins, K., Luo, Y., Li, Y., Kohlhoff, K.J., Ramachandran, D., Navalpakkam, V.: Rich human feedback for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 4

2024
[27]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 2, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

In: The Thirty- ninthAnnualConferenceonNeuralInformationProcessingSystems(2025),https: //openreview.net/forum?id=oCBKGw5HNf2, 4, 6, 11

Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P., ZHANG, D., Ouyang, W.: Flow-GRPO: Training flow matching models via online RL. In: The Thirty- ninthAnnualConferenceonNeuralInformationProcessingSystems(2025),https: //openreview.net/forum?id=oCBKGw5HNf2, 4, 6, 11

2025
[29]

Step1X-Edit: A Practical Framework for General Image Editing

Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) 2, 3, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022) 2, 4, 5 Reinforcement Learning for Multi-Turn Image Editing with Flow Matching 17

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=E7YpL4L4Xh3, 5, 7

Luo, X., Wang, J., Wu, C., Xiao, S., Jiang, X., Lian, D., Zhang, J., Liu, D., Liu, Z.: Editscore: Unlocking online RL for image editing via high-fidelity reward modeling. In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=E7YpL4L4Xh3, 5, 7

2026
[32]

Ad- vances in Neural Information Processing Systems37, 41494–41516 (2024) 4

Ma, Y., Ji, J., Ye, K., Lin, W., Wang, Z., Zheng, Y., Zhou, Q., Sun, X., Ji, R.: I2ebench: A comprehensive benchmark for instruction-based image editing. Ad- vances in Neural Information Processing Systems37, 41494–41516 (2024) 4

2024
[33]

Ma, Y., Wu, X., Sun, K., Li, H.: Hpsv3: Towards wide-spectrum human preference score (2025),https://arxiv.org/abs/2508.037894

work page arXiv 2025
[34]

com / index / introducing-4o-image-generation/(2025), accessed: 2025-06-22 2, 4

OpenAI: Introducing 4o image generation.https : / / openai . com / index / introducing-4o-image-generation/(2025), accessed: 2025-06-22 2, 4

2025
[35]

Lvlm- count: Enhancing the counting ability of large vision-language models, 2026

Qharabagh, M.F., Ghofrani, M., Fountoulakis, K.: Lvlm-count: Enhancing the counting ability of large vision-language models. arXiv preprint arXiv:2412.00686 (2024) 4

work page arXiv 2024
[36]

Qian, Y., Bocek-Rivele, E., Song, L., Tong, J., Yang, Y., Lu, J., Hu, W., Gan, Z.: Pico-banana-400k: A large-scale dataset for text-guided image editing (2025), https://arxiv.org/abs/2510.1980810

work page arXiv 2025
[37]

arXiv preprint arXiv:2505.11493 , year=

Qian, Y., Lu, J., Fu, T.J., Wang, X., Chen, C., Yang, Y., Hu, W., Gan, Z.: Gie- bench: Towards grounded evaluation for text-guided image editing. arXiv preprint arXiv:2505.11493 (2025) 8

work page arXiv 2025
[38]

In: SC20: international conference for high performance computing, networking, storage and analysis

Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: Memory optimizations to- ward training trillion parameter models. In: SC20: international conference for high performance computing, networking, storage and analysis. pp. 1–16. IEEE (2020) 11

2020
[39]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 2, 4

2022
[40]

Generalization in generation: A closer look at exposure bias.arXiv preprint arXiv:1910.00292, 2019

Schmidt, F.: Generalization in generation: A closer look at exposure bias. arXiv preprint arXiv:1910.00292 (2019) 2

work page arXiv 1910
[41]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) 4

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Sheynin, S., Polyak, A., Singer, U., Kirstain, Y., Zohar, A., Ashual, O., Parikh, D., Taigman, Y.: Emu edit: Precise image editing via recognition and generation tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8871–8879 (2024) 7

2024
[43]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025) 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Advances in neural information processing systems34, 1415–1428 (2021) 4

Song, Y., Durkan, C., Murray, I., Ermon, S.: Maximum likelihood training of score- based diffusion models. Advances in neural information processing systems34, 1415–1428 (2021) 4

2021
[45]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020) 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2011
[46]

Journal of Cognitive Neuroscience11(1), 126–134 (1999) 2, 4, 5

Sutton, R.S., Barto, A.G., et al.: Reinforcement learning. Journal of Cognitive Neuroscience11(1), 126–134 (1999) 2, 4, 5

1999
[47]

Vision Language Models are Biased

Vo, A., Nguyen, K.N., Taesiri, M.R., Dang, V.T., Nguyen, A.T., Kim, D.: Vision language models are biased. arXiv preprint arXiv:2505.23941 (2025) 4 18 J. Huang et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., Liu, Z., Xia, Z., Li, C., Deng, H., Wang, J., Luo, K., Zhang, B., Lian, D., Wang, X., Wang, Z., Huang, T., Liu, Z.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025) 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

In: The Fourteenth In- ternational Conference on Learning Representations (2026),https://openreview

Wu, K., Jiang, S., Ku, M., Nie, P., Liu, M., Chen, W.: Editreward: A human- aligned reward model for instruction-guided image editing. In: The Fourteenth In- ternational Conference on Learning Representations (2026),https://openreview. net/forum?id=eZu358JOOR3, 7

2026
[51]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13294–13304 (2025) 2, 4

2025
[52]

In: Proceedingsofthe37thInternationalConferenceonNeuralInformationProcessing Systems

Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: learning and evaluating human preferences for text-to-image generation. In: Proceedingsofthe37thInternationalConferenceonNeuralInformationProcessing Systems. pp. 15903–15935 (2023) 4

2023
[53]

DanceGRPO: Unleashing GRPO on Visual Generation

Xue, Z., Wu, J., Gao, Y., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al.: Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818 (2025) 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

arXiv preprint arXiv:2504.13143 , year=

Yang, S., Hui, M., Zhao, B., Zhou, Y., Ruiz, N., Xie, C.: Complex-edit: Cot-like instruction generation for complexity-controllable image editing benchmark. arXiv preprint arXiv:2504.13143 (2025) 2, 4

work page arXiv 2025
[56]

In: The Thirty-ninth Annual Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track (2025),https://openreview.net/forum?id=uUCSrMlfD32, 3, 4, 7, 11

Ye, Y., He, X., Li, Z., Lin, B., Yuan, S., Yan, Z., Hou, B., Yuan, L.: Imgedit: A unified image editing dataset and benchmark. In: The Thirty-ninth Annual Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track (2025),https://openreview.net/forum?id=uUCSrMlfD32, 3, 4, 7, 11

2025
[57]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y.: Anyedit: Mastering unified high-quality image editing for any idea. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26125–26135 (2025) 4, 7

2025
[58]

Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 4, 7

Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 4, 7

2023
[59]

In: Leonardis, A., Ricci, E., Roth, S., Rus- sakovsky, O., Sattler, T., Varol, G

Zhang,Y.,Yu,P.,Wu,Y.N.:Object-conditionedenergy-basedattentionmapalign- ment in text-to-image diffusion models. In: Leonardis, A., Ricci, E., Roth, S., Rus- sakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 55–71. Springer Nature Switzerland, Cham (2025) 4

2024
[60]

Zhang, Y., Yu, P., Zhu, Y., Chang, Y., Gao, F., Wu, Y.N., Leong, O.: Flow pri- ors for linear inverse problems via iterative corrupted trajectory matching. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=1H2e7USI094 Reinforcement Learning for Multi-Turn Image Editing with Flow Matching 19

2024
[61]

In: The Thirteenth International Conference on Learning Rep- resentations (2025),https://openreview.net/forum?id=84pDoCD4lH4

Zhang, Z., Hu, F., Lee, J., Shi, F., Kordjamshidi, P., Chai, J., Ma, Z.: Do vision- language models represent space and how? evaluating spatial frame of reference under ambiguities. In: The Thirteenth International Conference on Learning Rep- resentations (2025),https://openreview.net/forum?id=84pDoCD4lH4

2025
[62]

Advances in Neural Information Processing Systems37, 3058–3093 (2024) 4, 7

Zhao, H., Ma, X.S., Chen, L., Si, S., Wu, R., An, K., Yu, P., Zhang, M., Li, Q., Chang, B.: Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems37, 3058–3093 (2024) 4, 7

2024
[63]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Zhao,Y.,Gu,A.,Varma,R.,Luo,L.,Huang,C.C.,Xu,M.,Wright,L.,Shojanazeri, H., Ott, M., Shleifer, S., et al.: Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277 (2023) 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=VJZ477R89F2, 5, 6, 11 20 J

Zheng, K., Chen, H., Ye, H., Wang, H., Zhang, Q., Jiang, K., Su, H., Ermon, S., Zhu, J., Liu, M.Y.: DiffusionNFT: Online diffusion reinforcement with forward process. In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=VJZ477R89F2, 5, 6, 11 20 J. Huang et al. A Overview Fig.8: Overview of MT-EditF...

2026
[69]

Then give final score 1-5 in <Score> <Thought> [Analysis here] </Thought> <Score>X</Score> Multi-turn IF prompt

Fully Compliant: Instruction accurately and completely executed, all non-targeted original content perfectly preserved Provide analysis in <Thought> tag covering: instruction execution accuracy and completeness, preservation of non-targeted elements, unintended changes. Then give final score 1-5 in <Score> <Thought> [Analysis here] </Thought> <Score>X</Sc...
[70]

Failed: Instruction completely ignored/opposite changes made, or critical original content destroyed
[71]

Minimal: Only minor parts of instruction followed, major elements missing/wrong, or severe content loss/unintended changes
[72]

Partial: Key instruction elements followed but incomplete/ inaccurate, with noticeable original content loss/unintended modifications
[73]

Mostly Compliant: Instruction largely executed correctly with minor flaws, original content well-preserved with minimal unintended changes
[74]

Fully Compliant: Instruction accurately and completely executed, all non-targeted original content perfectly preserved Provide analysis in <Thought> tag covering each turn’s execution. Then give scores in format: <Thought> [Turn 1 analysis] [Turn 2 analysis] [Turn 3 analysis] </Thought> <Turn1Score>X</Turn1Score> <Turn2Score>X</Turn2Score> <Turn3Score>X</...

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Hallucination of Multimodal Large Language Models: A Survey

Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., Shou, M.Z.: Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930 (2024) 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Training Diffusion Models with Reinforcement Learning

Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2023) 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023) 4

2023

[5] [5]

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Win- ter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[6] [6]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Chang, Y., Zhang, Y., Fang, Z., Wu, Y.N., Bisk, Y., Gao, F.: Skews in the phe- nomenon space hinder generalization in text-to-image generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 422–439. Springer Nature Switzerland, Cham (2025) 4

2024

[7] [7]

In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition

Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 14455–14465 (2024) 4

2024

[8] [8]

In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=YkV0fnXgJA2, 3, 4, 7, 8, 10, 11

Chen, T., Zhang, Y., Zhang, Z., Yu, P., Wang, S., Wang, Z., Lin, K., Wang, X., Yang, Z., Li, L., Lin, C.C., Xie, J., Leong, O., Wang, L., Wu, Y.N., Zhou, M.: Edival-agent: An object-centric framework for automated, fine-grained evaluation of multi-turn editing. In: The Fourteenth International Conference on Learning Representations (2026),https://openrevi...

2026

[9] [9]

Advances in Neural Information Processing Systems37, 135062–135093 (2024) 4

Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. Advances in Neural Information Processing Systems37, 135062–135093 (2024) 4

2024

[10] [10]

Deepmind, G.: Gemini 2.5: Pushing the frontier with advanced reasoning, mul- timodality, long context, and next generation agentic capabilities (2025),https: //arxiv.org/abs/2507.062612, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Advances in Neural Information Processing Sys- tems36, 79858–79885 (2023) 4

Fan, Y., Watkins, O., Du, Y., Liu, H., Ryu, M., Boutilier, C., Abbeel, P., Ghavamzadeh, M., Lee, K., Lee, K.: Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. Advances in Neural Information Processing Sys- tems36, 79858–79885 (2023) 4

2023

[12] [12]

Seedream 3.0 Technical Report

Gao, Y., Gong, L., Guo, Q., Hou, X., Lai, Z., Li, F., Li, L., Lian, X., Liao, C., Liu, L., et al.: Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346 (2025) 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Huang et al

Gemini2, G.: Experiment with gemini 2.0 flash native image generation.https:// developers.googleblog.com/en/experiment-with-gemini-20-flash-native- image-generation/(2025), accessed: 2025-06-22 4 16 J. Huang et al

2025

[14] [14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

In: ICLR (2023) 4

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross-attention control. In: ICLR (2023) 4

2023

[16] [16]

Advances in neural information processing systems33, 6840–6851 (2020) 4, 5

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 4, 5

2020

[17] [17]

Classifier-Free Diffusion Guidance

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022) 5

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Iclr1(2), 3 (2022) 11

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022) 11

2022

[19] [19]

arXiv preprint arXiv:2404.09990 (2024) 4

Hui, M., Yang, S., Zhao, B., Shi, Y., Wang, H., Wang, P., Zhou, Y., Xie, C.: Hq- edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990 (2024) 4

work page arXiv 2024

[20] [20]

Ku, M., Jiang, D., Wei, C., Yue, X., Chen, W.: Viescore: Towards explainable metricsforconditionalimagesynthesisevaluation.arXivpreprintarXiv:2312.14867 (2023) 4

work page arXiv 2023

[21] [21]

International journal of computer vision128(7), 1956–1981 (2020) 10

Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Ka- mali, S., Popov, S., Malloci, M., Kolesnikov, A., et al.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision128(7), 1956–1981 (2020) 10

1956

[22] [22]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025) 2, 3, 11

2025

[23] [23]

Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Li, J., Cui, Y., Huang, T., Ma, Y., Fan, C., Yang, M., Zhong, Z.: Mixgrpo: Unlock- ing flow-based grpo efficiency with mixed ode-sde (2025),https://arxiv.org/ abs/2507.218025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Li, Z., Liu, Z., Zhang, Q., Lin, B., Yuan, S., Yan, Z., Ye, Y., Yu, W., Niu, Y., Yuan, L.: Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprint arXiv:2510.16888 (2025) 3, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 4

Liang, Y., He, J., Li, G., Li, P., Klimovskiy, A., Carolan, N., Sun, J., Pont-Tuset, J., Young, S., Yang, F., Ke, J., Dvijotham, K.D., Collins, K., Luo, Y., Li, Y., Kohlhoff, K.J., Ramachandran, D., Navalpakkam, V.: Rich human feedback for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 4

2024

[27] [27]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 2, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

In: The Thirty- ninthAnnualConferenceonNeuralInformationProcessingSystems(2025),https: //openreview.net/forum?id=oCBKGw5HNf2, 4, 6, 11

Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P., ZHANG, D., Ouyang, W.: Flow-GRPO: Training flow matching models via online RL. In: The Thirty- ninthAnnualConferenceonNeuralInformationProcessingSystems(2025),https: //openreview.net/forum?id=oCBKGw5HNf2, 4, 6, 11

2025

[29] [29]

Step1X-Edit: A Practical Framework for General Image Editing

Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) 2, 3, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022) 2, 4, 5 Reinforcement Learning for Multi-Turn Image Editing with Flow Matching 17

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=E7YpL4L4Xh3, 5, 7

Luo, X., Wang, J., Wu, C., Xiao, S., Jiang, X., Lian, D., Zhang, J., Liu, D., Liu, Z.: Editscore: Unlocking online RL for image editing via high-fidelity reward modeling. In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=E7YpL4L4Xh3, 5, 7

2026

[32] [32]

Ad- vances in Neural Information Processing Systems37, 41494–41516 (2024) 4

Ma, Y., Ji, J., Ye, K., Lin, W., Wang, Z., Zheng, Y., Zhou, Q., Sun, X., Ji, R.: I2ebench: A comprehensive benchmark for instruction-based image editing. Ad- vances in Neural Information Processing Systems37, 41494–41516 (2024) 4

2024

[33] [33]

Ma, Y., Wu, X., Sun, K., Li, H.: Hpsv3: Towards wide-spectrum human preference score (2025),https://arxiv.org/abs/2508.037894

work page arXiv 2025

[34] [34]

com / index / introducing-4o-image-generation/(2025), accessed: 2025-06-22 2, 4

OpenAI: Introducing 4o image generation.https : / / openai . com / index / introducing-4o-image-generation/(2025), accessed: 2025-06-22 2, 4

2025

[35] [35]

Lvlm- count: Enhancing the counting ability of large vision-language models, 2026

Qharabagh, M.F., Ghofrani, M., Fountoulakis, K.: Lvlm-count: Enhancing the counting ability of large vision-language models. arXiv preprint arXiv:2412.00686 (2024) 4

work page arXiv 2024

[36] [36]

Qian, Y., Bocek-Rivele, E., Song, L., Tong, J., Yang, Y., Lu, J., Hu, W., Gan, Z.: Pico-banana-400k: A large-scale dataset for text-guided image editing (2025), https://arxiv.org/abs/2510.1980810

work page arXiv 2025

[37] [37]

arXiv preprint arXiv:2505.11493 , year=

Qian, Y., Lu, J., Fu, T.J., Wang, X., Chen, C., Yang, Y., Hu, W., Gan, Z.: Gie- bench: Towards grounded evaluation for text-guided image editing. arXiv preprint arXiv:2505.11493 (2025) 8

work page arXiv 2025

[38] [38]

In: SC20: international conference for high performance computing, networking, storage and analysis

Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: Memory optimizations to- ward training trillion parameter models. In: SC20: international conference for high performance computing, networking, storage and analysis. pp. 1–16. IEEE (2020) 11

2020

[39] [39]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 2, 4

2022

[40] [40]

Generalization in generation: A closer look at exposure bias.arXiv preprint arXiv:1910.00292, 2019

Schmidt, F.: Generalization in generation: A closer look at exposure bias. arXiv preprint arXiv:1910.00292 (2019) 2

work page arXiv 1910

[41] [41]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) 4

work page internal anchor Pith review Pith/arXiv arXiv 2017

[42] [42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Sheynin, S., Polyak, A., Singer, U., Kirstain, Y., Zohar, A., Ashual, O., Parikh, D., Taigman, Y.: Emu edit: Precise image editing via recognition and generation tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8871–8879 (2024) 7

2024

[43] [43]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025) 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Advances in neural information processing systems34, 1415–1428 (2021) 4

Song, Y., Durkan, C., Murray, I., Ermon, S.: Maximum likelihood training of score- based diffusion models. Advances in neural information processing systems34, 1415–1428 (2021) 4

2021

[45] [45]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020) 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2011

[46] [46]

Journal of Cognitive Neuroscience11(1), 126–134 (1999) 2, 4, 5

Sutton, R.S., Barto, A.G., et al.: Reinforcement learning. Journal of Cognitive Neuroscience11(1), 126–134 (1999) 2, 4, 5

1999

[47] [47]

Vision Language Models are Biased

Vo, A., Nguyen, K.N., Taesiri, M.R., Dang, V.T., Nguyen, A.T., Kim, D.: Vision language models are biased. arXiv preprint arXiv:2505.23941 (2025) 4 18 J. Huang et al

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., Liu, Z., Xia, Z., Li, C., Deng, H., Wang, J., Luo, K., Zhang, B., Lian, D., Wang, X., Wang, Z., Huang, T., Liu, Z.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025) 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

In: The Fourteenth In- ternational Conference on Learning Representations (2026),https://openreview

Wu, K., Jiang, S., Ku, M., Nie, P., Liu, M., Chen, W.: Editreward: A human- aligned reward model for instruction-guided image editing. In: The Fourteenth In- ternational Conference on Learning Representations (2026),https://openreview. net/forum?id=eZu358JOOR3, 7

2026

[51] [51]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13294–13304 (2025) 2, 4

2025

[52] [52]

In: Proceedingsofthe37thInternationalConferenceonNeuralInformationProcessing Systems

Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: learning and evaluating human preferences for text-to-image generation. In: Proceedingsofthe37thInternationalConferenceonNeuralInformationProcessing Systems. pp. 15903–15935 (2023) 4

2023

[53] [53]

DanceGRPO: Unleashing GRPO on Visual Generation

Xue, Z., Wu, J., Gao, Y., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al.: Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818 (2025) 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

arXiv preprint arXiv:2504.13143 , year=

Yang, S., Hui, M., Zhao, B., Zhou, Y., Ruiz, N., Xie, C.: Complex-edit: Cot-like instruction generation for complexity-controllable image editing benchmark. arXiv preprint arXiv:2504.13143 (2025) 2, 4

work page arXiv 2025

[56] [56]

In: The Thirty-ninth Annual Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track (2025),https://openreview.net/forum?id=uUCSrMlfD32, 3, 4, 7, 11

Ye, Y., He, X., Li, Z., Lin, B., Yuan, S., Yan, Z., Hou, B., Yuan, L.: Imgedit: A unified image editing dataset and benchmark. In: The Thirty-ninth Annual Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track (2025),https://openreview.net/forum?id=uUCSrMlfD32, 3, 4, 7, 11

2025

[57] [57]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y.: Anyedit: Mastering unified high-quality image editing for any idea. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26125–26135 (2025) 4, 7

2025

[58] [58]

Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 4, 7

Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 4, 7

2023

[59] [59]

In: Leonardis, A., Ricci, E., Roth, S., Rus- sakovsky, O., Sattler, T., Varol, G

Zhang,Y.,Yu,P.,Wu,Y.N.:Object-conditionedenergy-basedattentionmapalign- ment in text-to-image diffusion models. In: Leonardis, A., Ricci, E., Roth, S., Rus- sakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 55–71. Springer Nature Switzerland, Cham (2025) 4

2024

[60] [60]

Zhang, Y., Yu, P., Zhu, Y., Chang, Y., Gao, F., Wu, Y.N., Leong, O.: Flow pri- ors for linear inverse problems via iterative corrupted trajectory matching. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=1H2e7USI094 Reinforcement Learning for Multi-Turn Image Editing with Flow Matching 19

2024

[61] [61]

In: The Thirteenth International Conference on Learning Rep- resentations (2025),https://openreview.net/forum?id=84pDoCD4lH4

Zhang, Z., Hu, F., Lee, J., Shi, F., Kordjamshidi, P., Chai, J., Ma, Z.: Do vision- language models represent space and how? evaluating spatial frame of reference under ambiguities. In: The Thirteenth International Conference on Learning Rep- resentations (2025),https://openreview.net/forum?id=84pDoCD4lH4

2025

[62] [62]

Advances in Neural Information Processing Systems37, 3058–3093 (2024) 4, 7

Zhao, H., Ma, X.S., Chen, L., Si, S., Wu, R., An, K., Yu, P., Zhang, M., Li, Q., Chang, B.: Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems37, 3058–3093 (2024) 4, 7

2024

[63] [63]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Zhao,Y.,Gu,A.,Varma,R.,Luo,L.,Huang,C.C.,Xu,M.,Wright,L.,Shojanazeri, H., Ott, M., Shleifer, S., et al.: Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277 (2023) 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[64] [64]

In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=VJZ477R89F2, 5, 6, 11 20 J

Zheng, K., Chen, H., Ye, H., Wang, H., Zhang, Q., Jiang, K., Su, H., Ermon, S., Zhu, J., Liu, M.Y.: DiffusionNFT: Online diffusion reinforcement with forward process. In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=VJZ477R89F2, 5, 6, 11 20 J. Huang et al. A Overview Fig.8: Overview of MT-EditF...

2026

[65] [69]

Then give final score 1-5 in <Score> <Thought> [Analysis here] </Thought> <Score>X</Score> Multi-turn IF prompt

Fully Compliant: Instruction accurately and completely executed, all non-targeted original content perfectly preserved Provide analysis in <Thought> tag covering: instruction execution accuracy and completeness, preservation of non-targeted elements, unintended changes. Then give final score 1-5 in <Score> <Thought> [Analysis here] </Thought> <Score>X</Sc...

[66] [70]

Failed: Instruction completely ignored/opposite changes made, or critical original content destroyed

[67] [71]

Minimal: Only minor parts of instruction followed, major elements missing/wrong, or severe content loss/unintended changes

[68] [72]

Partial: Key instruction elements followed but incomplete/ inaccurate, with noticeable original content loss/unintended modifications

[69] [73]

Mostly Compliant: Instruction largely executed correctly with minor flaws, original content well-preserved with minimal unintended changes

[70] [74]

Fully Compliant: Instruction accurately and completely executed, all non-targeted original content perfectly preserved Provide analysis in <Thought> tag covering each turn’s execution. Then give scores in format: <Thought> [Turn 1 analysis] [Turn 2 analysis] [Turn 3 analysis] </Thought> <Turn1Score>X</Turn1Score> <Turn2Score>X</Turn2Score> <Turn3Score>X</...