Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

Fangyuan Kong; Haoran He; Kun Gai; Ling Pan; Pengfei Wan; Xintao Wang; Yuxiao Ye

arxiv: 2606.05950 · v2 · pith:CTZ3DTJSnew · submitted 2026-06-04 · 💻 cs.AI

Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

Yuxiao Ye , Haoran He , Fangyuan Kong , Xintao Wang , Pengfei Wan , Kun Gai , Ling Pan This is my paper

Pith reviewed 2026-06-29 05:08 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-turn image editingreinforcement learningmultimodal modelsintent reconstructioncontext awarenessdiffusion modelsimage generation

0 comments

The pith

Edit-R2 reconstructs session intent to enable reliable multi-turn image editing with reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Edit-R2 as a post-training method that applies reinforcement learning to multimodal models for iterative image editing. It tackles challenges where long histories of instructions and images cause dilution of key constraints and where past mistakes affect new outputs. The core technique reconstructs an overall session intent into a clear reasoning step before generating each edit. A unified training objective optimizes both the text reasoning and the image generation, with filtering to remove bad training examples. This results in stronger performance on following instructions, keeping content consistent, and maintaining awareness of all prior constraints across multiple turns.

Core claim

Edit-R2 reconstructs the operative session intent to consolidate scattered historical constraints into an explicit reasoning trace before each editing turn, and applies multi-turn RL over both reasoning and generation through a unified objective that jointly optimizes intent reconstruction in text space and flow-matching image generation in latent space, while using trajectory filtering to suppress corrupted rollouts.

What carries the argument

Operative session intent reconstruction that turns scattered constraints into an explicit reasoning trace, paired with a unified RL objective across discrete text and continuous image spaces plus trajectory filtering.

If this is right

Multi-turn editing models follow new instructions while preserving all accumulated session constraints.
Earlier editing mistakes have less negative effect on later turns due to state contamination reduction.
The method yields competitive results against strong baselines on instruction following, content consistency, and global awareness.
Systematic evaluation becomes possible through the introduced MICE-Bench benchmark with automated metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar intent reconstruction could apply to other sequential tasks involving history like conversational agents or iterative design tools.
Future work might test if this approach scales to longer sessions or different modalities such as video.
The trajectory filtering mechanism may generalize to other RL setups prone to error accumulation.

Load-bearing premise

Reconstructing the operative session intent consolidates scattered historical constraints into an explicit reasoning trace that prevents long-context dilution and state contamination.

What would settle it

If models trained with Edit-R2 show no gains in multi-turn performance metrics over standard fine-tuning on the MICE-Bench dataset, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2606.05950 by Fangyuan Kong, Haoran He, Kun Gai, Ling Pan, Pengfei Wan, Xintao Wang, Yuxiao Ye.

**Figure 1.** Figure 1: Overview of in-context editing. (a)-(b): [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: (a): Long-context issue for in-context editing. (b): Attention magnitude of the flow [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of MICE-Bench. (a)-(b): Data construction pipeline. (c): Task category [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of Edit-R2. reward as follows: rt = wIF · IF(Yt , Tt , Yt+1) + wCC · CC(Y0, Yt+1) + wGA · GA(st , Yt+1), (1) where IF ∈ {0, 1}, CC ∈ [0, 1], and GA ∈ {0, 1} denote instruction-following, content-consistency, and global-awareness metrics, which we directly adopt as reward components for RL training. The objective is to find θ maximizing the expected cumulative reward maxθ Eπθ [ PK−1 t=0 rt ]. 4.2 E… view at source ↗

**Figure 5.** Figure 5: (a) Example of emergent IC-CoT behavior. After Edit-R2 training, IC-CoT produces more [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Attention visualization comparing BAGEL and Edit-R2. BAGEL gets lost in the long inter [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison on Content Memory task. Erroneous areas are in red boxes. Attention analysis: Edit-R2 recovers session intent lost by BAGEL. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison on Content Understanding task. Erroneous areas are in red boxes. propose Edit-R2, comprising IC-CoT for session intent reconstruction and unified multi-turn RL that jointly optimizes reasoning and image generation, alongside MICE-Bench for systematic evaluation. Edit-R2 yields substantial gains and achieves competitive performance against strong models. Limitations. MICE-Bench covers… view at source ↗

**Figure 9.** Figure 9: Training dynamics. Average rewards of IF, GA and CC are reported at each turn. Compared [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Ablations on reward weights. Incorporating CC in reward significantly mitigating the [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Ablations on self-ensembled reward model IF. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Additional qualitative comparison on the [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 12.** Figure 12: Additional qualitative comparison on the [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Additional qualitative comparison on the [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 13.** Figure 13: Additional qualitative comparison on the [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Illustration of QCoT. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 14.** Figure 14: Illustration of QCoT. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt template used for VLM-Based GA metric on [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 15.** Figure 15: Prompt template used for VLM-Based GA metric on [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗

**Figure 16.** Figure 16: Prompt template used for VLM-Based GA metric on [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗

**Figure 16.** Figure 16: Prompt template used for VLM-Based GA metric on [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗

**Figure 17.** Figure 17: Example of GA output for failed editing results on content memory task. [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗

**Figure 17.** Figure 17: Example of GA output for failed editing results on content memory task. [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗

**Figure 18.** Figure 18: Example of GA output for successful editing results on content memory task. The outputs [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗

**Figure 18.** Figure 18: Example of GA output for successful editing results on content memory task. The [PITH_FULL_IMAGE:figures/full_fig_p033_18.png] view at source ↗

**Figure 19.** Figure 19: Example of GA output for failed editing results on content understanding task. [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗

**Figure 19.** Figure 19: Example of GA output for failed editing results on content understanding task. [PITH_FULL_IMAGE:figures/full_fig_p034_19.png] view at source ↗

**Figure 20.** Figure 20: Example of GA output for successful editing results on content understanding task. [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗

**Figure 20.** Figure 20: Example of GA output for successful editing results on content understanding task. [PITH_FULL_IMAGE:figures/full_fig_p035_20.png] view at source ↗

read the original abstract

Text-guided image editing has advanced rapidly with diffusion models and unified multimodal foundation models. However, most existing methods remain confined to single-turn settings, overlooking the more realistic scenario of multi-turn in-context editing, where users iteratively refine an image through a sequence of instructions. In this setting, a model must follow each new instruction while preserving accumulated session-level constraints, challenged by two coupled failure modes: long-context dilution, where sparse textual constraints become difficult to recover from growing interleaved image-text histories, and state contamination, where earlier editing mistakes degrade subsequent generations. We introduce Edit-R2, a novel reinforcement learning post-training framework for unified multimodal models. Edit-R2 reconstructs the operative session intent, which effectively consolidates scattered historical constraints into an explicit reasoning trace before each editing turn. It further enables multi-turn RL over both reasoning and generation through a unified objective that jointly optimizes intent reconstruction generation in discrete text space and flow-matching image generation in continuous latent space, while a trajectory filtering mechanism suppresses corrupted rollouts to stabilize training under state contamination. To support systematic evaluation, we introduce MICE-Bench, a large-scale benchmark for multi-turn in-context editing with automated metrics for instruction following (IF), content consistency (CC), and global awareness (GA) over accumulated session constraints. Experiments show that Edit-R2 substantially improves multi-turn in-context editing and achieves competitive performance compared against strong baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Edit-R2 adds intent reconstruction and a unified RL objective with trajectory filtering to multi-turn image editing, plus a new benchmark, but the gains rest on unshown numbers.

read the letter

The paper's core move is to treat multi-turn editing as a session-level problem rather than a sequence of independent single-turn edits. It reconstructs an explicit session intent from the history before each turn, then runs RL that optimizes both the text reasoning and the flow-matching image steps under one objective, with a filter to drop corrupted trajectories.

What stands out is the benchmark, MICE-Bench, which scores instruction following, content consistency, and global awareness over accumulated constraints. That setup is a concrete step beyond the single-turn evaluations that dominate the cited prior work. The failure modes they name—long-context dilution and state contamination—are stated plainly and the proposed fixes line up with them.

The abstract claims substantial gains and competitive results against strong baselines, yet supplies none of the numbers, ablations, or variance estimates. Without those, it is impossible to tell whether the unified objective actually moves the needle or whether the filtering simply removes the hard cases. The method description is coherent, but the empirical support is still thin.

This is for groups building applied editing tools that need to handle iterative user sessions. Readers working on RL post-training for multimodal models or on multi-turn benchmarks will find the setup useful to examine. The work is coherent enough on its own terms to merit referee time, though any review should focus first on the missing quantitative details and the exact implementation of the trajectory filter.

Referee Report

2 major / 2 minor

Summary. The paper introduces Edit-R2, a reinforcement learning post-training framework for multi-turn in-context image editing with unified multimodal models. It reconstructs operative session intent to consolidate historical constraints into an explicit reasoning trace, applies a unified RL objective jointly optimizing discrete-text intent reconstruction and continuous-latent flow-matching image generation, incorporates trajectory filtering to suppress corrupted rollouts, and introduces the MICE-Bench benchmark with automated metrics for instruction following (IF), content consistency (CC), and global awareness (GA). Experiments are reported to show substantial improvements and competitive performance against strong baselines.

Significance. If the empirical claims hold, the work addresses a realistic and underexplored setting in text-guided image editing by mitigating long-context dilution and state contamination through intent reconstruction and filtering. The introduction of MICE-Bench provides a standardized evaluation resource, and the unified cross-modal RL objective represents a technically interesting direction for multimodal post-training. These elements could influence subsequent research on iterative multimodal agents if the gains are reproducible and the method generalizes.

major comments (2)

[Abstract] Abstract: The central claim that Edit-R2 'substantially improves multi-turn in-context editing' is stated without any numerical results, error bars, ablation tables, or statistical tests. This absence prevents verification of the magnitude or reliability of the reported gains and is load-bearing for assessing whether the unified RL objective and trajectory filtering deliver the claimed benefits.
[Method] Method description (unified objective): It is unclear from the high-level description whether the joint optimization over discrete text space and continuous latent space uses separate reward models or a single shared signal, and whether any component of the objective is fitted on data that overlaps with the MICE-Bench evaluation set. This directly affects the circularity concern for the performance claims.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., average IF or GA improvement) to support the 'substantially improves' statement.
[Method] Notation for the unified RL objective and the trajectory filtering threshold should be defined explicitly with equations if they appear in the full text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and substantiation of claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that Edit-R2 'substantially improves multi-turn in-context editing' is stated without any numerical results, error bars, ablation tables, or statistical tests. This absence prevents verification of the magnitude or reliability of the reported gains and is load-bearing for assessing whether the unified RL objective and trajectory filtering deliver the claimed benefits.

Authors: We agree that the abstract would be strengthened by including representative quantitative results. In the revised version we will add concise numerical highlights of the main gains (e.g., relative improvements on IF, CC, and GA) together with pointers to the corresponding tables and figures. Full error bars, ablation tables, and statistical tests remain in the experiments section; space constraints preclude their inclusion in the abstract itself. revision: yes
Referee: [Method] Method description (unified objective): It is unclear from the high-level description whether the joint optimization over discrete text space and continuous latent space uses separate reward models or a single shared signal, and whether any component of the objective is fitted on data that overlaps with the MICE-Bench evaluation set. This directly affects the circularity concern for the performance claims.

Authors: We will revise the method section to state explicitly that a single shared multimodal reward signal is used for both the discrete intent-reconstruction and continuous flow-matching objectives. We will also add a clear statement that the RL training trajectories are drawn from a held-out collection distinct from the MICE-Bench evaluation set, thereby removing any circularity concern. These additions address the high-level description without altering the underlying algorithm. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes a post-training RL framework (intent reconstruction + unified objective over text reasoning and flow-matching generation + trajectory filtering) evaluated on a newly introduced benchmark MICE-Bench. No equations, derivations, or self-citations are shown that reduce any claimed prediction or result to a fitted input or prior self-result by construction. The central claims rest on experimental comparisons to external baselines rather than internal redefinitions or tautological steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5794 in / 947 out tokens · 26759 ms · 2026-06-29T05:08:03.743838+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

25 extracted references

[1]

it,” “them,

Coreference resolution: replace vague pronouns (“it,” “them,” “its,” etc.) with the specific object names visible in the current imageY t
[2]

BAGEL w/ think

Global constraint injection: if any earlier turn specifies a persistent constraint (e.g., a color or material requirement for all subsequent edits), explicitly incorporate that constraint into the current instruction. Examples of QCoT are shown in Figure 14. Note that these questions are tailored to MICE-Bench, which differ from BAGEL’s original thinking ...

2026
[3]

Replace any vague pronouns (it, them, they, its, the one, etc.) with the specific object names visible in the image
[4]

Think briefly, under 100 words

If any earlier round specifies a persistent global constraint (e.g., 'Ensure all subsequent edits are yellow', 'Use leather material for all edits'), explicitly incorporate that constraint into the current instruction. Think briefly, under 100 words. Output ONLY the enhanced instruction. No explanation, no extra text. Figure 14: Illustration ofQ CoT. 29 Y...
[5]

Instruction Adherence: Image i+1 successfully reflects the modifications specified by Instruction i
[6]

objects added or modified in subsequent edits must be yellow

Global Constraint Compliance: If the first instruction establishes a global constraint that governs the entire editing session (e.g., "objects added or modified in subsequent edits must be yellow"), then Image i+1 must adhere to this constraint. The specific rules are as follows: - Scope of Application: Global constraint compliance is required only with r...
[7]

it", "them

Pronoun Resolution (Content Understanding): If Instruction i contains pronouns (e.g., "it", "them", "there", "its"), the object edited in Image i+1 must correspond to the object annotated in square brackets in Formatted Instruction i. That is, the model must correctly identify which specific object the pronoun refers to and perform the edit on that object...
[8]

The pronoun 'it' should refer to [red car], but the model incorrectly removed the blue truck

Instruction Following: Image i+1 successfully reflects the modifications specified by Instruction i — including the correct edit type (add / remove / replace / recolor, etc.) and the correct attribute value (e.g., the color is indeed changed to the specified color). Execution and Output Logic: Evaluate each turn sequentially (Turn 1, Turn 2, ...). - If Tu...
[9]

- Wool grey scarf and plastic black glasses remain unchanged, meeting the instruction

**Instruction Following:** - Background successfully changed to a red curtain, meeting the instruction. - Wool grey scarf and plastic black glasses remain unchanged, meeting the instruction
[10]

Here, the background was modified to a red curtain, which complies with the constraint

**Global Constraint Following:** - The global constraint specifies that red color can only be applied to directly added or modified objects. Here, the background was modified to a red curtain, which complies with the constraint. **Conclusion:** <answer_turn_1> yes </answer_turn_1> <reason_turn_1> Background successfully changed to a red curtain, wool grey...
[11]

**Instruction Following:** - The wool grey scarf has been successfully removed, meeting the instruction
[12]

**Global Constraint Following:** - The instruction is to remove an object, which does not involve adding or modifying objects; thus, the global constraint is irrelevant. **Conclusion:** <answer_turn_2> yes </answer_turn_2> <reason_turn_2> Wool grey scarf successfully removed, meeting the instruction; removal operation does not require considering the glob...
[13]

change the material to metal

**Instruction Following:** - The material of the glasses has been successfully changed from plastic to metal, meeting the instruction requirement of "change the material to metal."
[14]

directly modified object

**Global Constraint Following:** - The global constraint states that in all subsequent edits, red color can only be applied to directly added or modified objects. The object modified in this round is the glasses, whose material was changed to metal, making it a "directly modified object." According to the global constraint, this modified object should be ...
[15]

**Instruction Following:** - The material of the plastic black glasses has been successfully changed to red metal, meeting the instruction
[16]

Here, the glasses' material was changed to metal, and red color is applied only to the glasses themselves, complying with the constraint

**Global Constraint Following:** - The global constraint states that red color can only be applied to directly added or modified objects. Here, the glasses' material was changed to metal, and red color is applied only to the glasses themselves, complying with the constraint. **Conclusion:** <answer_turn_3> yes </answer_turn_3> <reason_turn_3> The material...
[18]

In Image 2, an orange flower has indeed been added to the left of the black plastic handle, meeting the instruction

**Instruction Following:** - The instruction requires adding a flower to the left of the black plastic handle. In Image 2, an orange flower has indeed been added to the left of the black plastic handle, meeting the instruction. **Conclusion:** <answer_turn_1> yes </answer_turn_1> <reason_turn_1> The instruction requires adding a flower to the left of the ...
[19]

According to the explicit reference instruction, the model needs to change the color of the flower

**Pronoun Resolution (Content Understanding):** - The pronoun "its" in the instruction refers to the object added in the previous round, i.e., the flower. According to the explicit reference instruction, the model needs to change the color of the flower. However, the model did not correctly understand that "its" refers to the newly added flower in the fir...
[20]

Therefore, the instruction is not correctly followed

**Instruction Following:** - Although the model changed a flower to orange, it operated on the wrong object (the original flower rather than the newly added one). Therefore, the instruction is not correctly followed. **Conclusion:** <answer_turn_2> no </answer_turn_2> <reason_turn_2> The model failed to correctly understand that "its" refers to the flower...
[21]

**Pronoun Resolution (Content Understanding):** - The instruction in this turn contains no pronouns, so pronoun resolution is not required
[22]

Observing Image 2, a yellow flower has indeed been added to the left of the plastic black handle

**Instruction Following:** - The instruction requires adding a flower to the left of the plastic black handle. Observing Image 2, a yellow flower has indeed been added to the left of the plastic black handle. **Conclusion:** <answer_turn_1> yes </answer_turn_1> <reason_turn_1> The instruction requires adding a flower to the left of the plastic black handl...
[23]

its" in this turn refers to the flower added in the previous round. The explicit reference instruction clearly refers to

**Pronoun Resolution (Content Understanding):** - The pronoun "its" in this turn refers to the flower added in the previous round. The explicit reference instruction clearly refers to "[flower]", i.e., the yellow flower added in the previous round. The model indeed changes the color of the flower to orange in Image 3, so pronoun resolution is correct
[24]

Observing Image 3, the flower's color has been successfully changed to orange

**Instruction Following:** - The instruction requires changing the color of the flower to orange. Observing Image 3, the flower's color has been successfully changed to orange. **Conclusion:** <answer_turn_2> yes </answer_turn_2> <reason_turn_2> The instruction requires changing the color of the flower to orange. Image 3 successfully changes the flower's ...
[25]

it" in this turn refers to the orange flower edited in the previous round. The explicit reference instruction clearly refers to

**Pronoun Resolution (Content Understanding):** - The pronoun "it" in this turn refers to the orange flower edited in the previous round. The explicit reference instruction clearly refers to "[orange flower]", i.e., the flower whose color was changed to orange in the previous round. The model indeed removes the orange flower in Image 4, so pronoun resolut...
[26]

Observing Image 4, the orange flower has been removed, meeting the instruction

**Instruction Following:** - The instruction requires removing the orange flower. Observing Image 4, the orange flower has been removed, meeting the instruction. **Conclusion:** <answer_turn_3> yes </answer_turn_3> <reason_turn_3> The instruction requires removing the orange flower. Image 4 successfully removes the orange flower, meeting the instruction. ...

[1] [1]

it,” “them,

Coreference resolution: replace vague pronouns (“it,” “them,” “its,” etc.) with the specific object names visible in the current imageY t

[2] [2]

BAGEL w/ think

Global constraint injection: if any earlier turn specifies a persistent constraint (e.g., a color or material requirement for all subsequent edits), explicitly incorporate that constraint into the current instruction. Examples of QCoT are shown in Figure 14. Note that these questions are tailored to MICE-Bench, which differ from BAGEL’s original thinking ...

2026

[3] [3]

Replace any vague pronouns (it, them, they, its, the one, etc.) with the specific object names visible in the image

[4] [4]

Think briefly, under 100 words

If any earlier round specifies a persistent global constraint (e.g., 'Ensure all subsequent edits are yellow', 'Use leather material for all edits'), explicitly incorporate that constraint into the current instruction. Think briefly, under 100 words. Output ONLY the enhanced instruction. No explanation, no extra text. Figure 14: Illustration ofQ CoT. 29 Y...

[5] [5]

Instruction Adherence: Image i+1 successfully reflects the modifications specified by Instruction i

[6] [6]

objects added or modified in subsequent edits must be yellow

Global Constraint Compliance: If the first instruction establishes a global constraint that governs the entire editing session (e.g., "objects added or modified in subsequent edits must be yellow"), then Image i+1 must adhere to this constraint. The specific rules are as follows: - Scope of Application: Global constraint compliance is required only with r...

[7] [7]

it", "them

Pronoun Resolution (Content Understanding): If Instruction i contains pronouns (e.g., "it", "them", "there", "its"), the object edited in Image i+1 must correspond to the object annotated in square brackets in Formatted Instruction i. That is, the model must correctly identify which specific object the pronoun refers to and perform the edit on that object...

[8] [8]

The pronoun 'it' should refer to [red car], but the model incorrectly removed the blue truck

Instruction Following: Image i+1 successfully reflects the modifications specified by Instruction i — including the correct edit type (add / remove / replace / recolor, etc.) and the correct attribute value (e.g., the color is indeed changed to the specified color). Execution and Output Logic: Evaluate each turn sequentially (Turn 1, Turn 2, ...). - If Tu...

[9] [9]

- Wool grey scarf and plastic black glasses remain unchanged, meeting the instruction

**Instruction Following:** - Background successfully changed to a red curtain, meeting the instruction. - Wool grey scarf and plastic black glasses remain unchanged, meeting the instruction

[10] [10]

Here, the background was modified to a red curtain, which complies with the constraint

**Global Constraint Following:** - The global constraint specifies that red color can only be applied to directly added or modified objects. Here, the background was modified to a red curtain, which complies with the constraint. **Conclusion:** <answer_turn_1> yes </answer_turn_1> <reason_turn_1> Background successfully changed to a red curtain, wool grey...

[11] [11]

**Instruction Following:** - The wool grey scarf has been successfully removed, meeting the instruction

[12] [12]

**Global Constraint Following:** - The instruction is to remove an object, which does not involve adding or modifying objects; thus, the global constraint is irrelevant. **Conclusion:** <answer_turn_2> yes </answer_turn_2> <reason_turn_2> Wool grey scarf successfully removed, meeting the instruction; removal operation does not require considering the glob...

[13] [13]

change the material to metal

**Instruction Following:** - The material of the glasses has been successfully changed from plastic to metal, meeting the instruction requirement of "change the material to metal."

[14] [14]

directly modified object

**Global Constraint Following:** - The global constraint states that in all subsequent edits, red color can only be applied to directly added or modified objects. The object modified in this round is the glasses, whose material was changed to metal, making it a "directly modified object." According to the global constraint, this modified object should be ...

[15] [15]

**Instruction Following:** - The material of the plastic black glasses has been successfully changed to red metal, meeting the instruction

[16] [16]

Here, the glasses' material was changed to metal, and red color is applied only to the glasses themselves, complying with the constraint

**Global Constraint Following:** - The global constraint states that red color can only be applied to directly added or modified objects. Here, the glasses' material was changed to metal, and red color is applied only to the glasses themselves, complying with the constraint. **Conclusion:** <answer_turn_3> yes </answer_turn_3> <reason_turn_3> The material...

[17] [18]

In Image 2, an orange flower has indeed been added to the left of the black plastic handle, meeting the instruction

**Instruction Following:** - The instruction requires adding a flower to the left of the black plastic handle. In Image 2, an orange flower has indeed been added to the left of the black plastic handle, meeting the instruction. **Conclusion:** <answer_turn_1> yes </answer_turn_1> <reason_turn_1> The instruction requires adding a flower to the left of the ...

[18] [19]

According to the explicit reference instruction, the model needs to change the color of the flower

**Pronoun Resolution (Content Understanding):** - The pronoun "its" in the instruction refers to the object added in the previous round, i.e., the flower. According to the explicit reference instruction, the model needs to change the color of the flower. However, the model did not correctly understand that "its" refers to the newly added flower in the fir...

[19] [20]

Therefore, the instruction is not correctly followed

**Instruction Following:** - Although the model changed a flower to orange, it operated on the wrong object (the original flower rather than the newly added one). Therefore, the instruction is not correctly followed. **Conclusion:** <answer_turn_2> no </answer_turn_2> <reason_turn_2> The model failed to correctly understand that "its" refers to the flower...

[20] [21]

**Pronoun Resolution (Content Understanding):** - The instruction in this turn contains no pronouns, so pronoun resolution is not required

[21] [22]

Observing Image 2, a yellow flower has indeed been added to the left of the plastic black handle

**Instruction Following:** - The instruction requires adding a flower to the left of the plastic black handle. Observing Image 2, a yellow flower has indeed been added to the left of the plastic black handle. **Conclusion:** <answer_turn_1> yes </answer_turn_1> <reason_turn_1> The instruction requires adding a flower to the left of the plastic black handl...

[22] [23]

its" in this turn refers to the flower added in the previous round. The explicit reference instruction clearly refers to

**Pronoun Resolution (Content Understanding):** - The pronoun "its" in this turn refers to the flower added in the previous round. The explicit reference instruction clearly refers to "[flower]", i.e., the yellow flower added in the previous round. The model indeed changes the color of the flower to orange in Image 3, so pronoun resolution is correct

[23] [24]

Observing Image 3, the flower's color has been successfully changed to orange

**Instruction Following:** - The instruction requires changing the color of the flower to orange. Observing Image 3, the flower's color has been successfully changed to orange. **Conclusion:** <answer_turn_2> yes </answer_turn_2> <reason_turn_2> The instruction requires changing the color of the flower to orange. Image 3 successfully changes the flower's ...

[24] [25]

it" in this turn refers to the orange flower edited in the previous round. The explicit reference instruction clearly refers to

**Pronoun Resolution (Content Understanding):** - The pronoun "it" in this turn refers to the orange flower edited in the previous round. The explicit reference instruction clearly refers to "[orange flower]", i.e., the flower whose color was changed to orange in the previous round. The model indeed removes the orange flower in Image 4, so pronoun resolut...

[25] [26]

Observing Image 4, the orange flower has been removed, meeting the instruction

**Instruction Following:** - The instruction requires removing the orange flower. Observing Image 4, the orange flower has been removed, meeting the instruction. **Conclusion:** <answer_turn_3> yes </answer_turn_3> <reason_turn_3> The instruction requires removing the orange flower. Image 4 successfully removes the orange flower, meeting the instruction. ...