pith. machine review for the scientific record. sign in

arxiv: 2604.20258 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing

Chang Xu, Jingxuan He, Mengyu Zheng, Xiangyu Zeng, Xiyu Wang, Yunke Wang

Pith reviewed 2026-05-10 00:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords instruction-based image editingedit localizationtask-aware maskingattention cuesfeature centroidsnon-edit consistencydiffusion transformers
0
0 comments X

The pith

Task-aware localization using attention cues from source and target images reduces unintended changes in instruction-based image editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Instruction-based image editing often alters image areas that the text instruction does not mention, because models lack an explicit way to identify the right regions. Different operations such as adding an object, removing one, or replacing content create distinct spatial attention patterns. The paper shows that building masks from attention signals in both the original image and the target edited image, then choosing which signals to use according to the operation type, partitions the image into edit and non-edit zones more accurately. This keeps unrelated content stable while the edit still follows the instruction. Experiments confirm the approach raises consistency in non-edited regions without lowering how well the generated changes match the given text.

Core claim

The paper establishes that localization in instruction-based image editing must be made task-dependent rather than uniform, and it supplies a training-free method that extracts attention-based edit cues from the source and target image streams, forms feature centroids from those cues to separate tokens, and applies a unified mask construction rule that draws on the appropriate stream for each editing task.

What carries the argument

Unified task-dependent mask construction that combines attention-based edit cues and feature centroids drawn from source and target image streams to partition tokens into edit versus non-edit regions.

If this is right

  • Non-edit regions remain more consistent across a range of editing operations while instruction following stays strong.
  • The same localization logic can be added on top of existing diffusion transformer backbones without retraining them.
  • Systematic analysis shows that the best choice of source versus target cues varies predictably with the type of edit being performed.
  • The approach works without any additional training or fine-tuning steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cue-to-centroid partitioning idea could be tested on video editing where temporal consistency is also at stake.
  • If the centroids prove stable, future models might learn to predict the task-specific mask directly instead of relying on post-hoc attention extraction.
  • The observation that localization needs differ by operation type suggests similar task-aware gating could help other conditional generation settings.

Load-bearing premise

Attention patterns in the source and target streams can be turned into feature centroids that cleanly separate edit regions from non-edit regions in a way that matches the specific editing task and does not create fresh artifacts.

What would settle it

Running the same instructions through the method on a new test set and finding either a drop in how faithfully the output follows the text or an increase in unintended changes outside the intended edit region would disprove the central claim.

Figures

Figures reproduced from arXiv: 2604.20258 by Chang Xu, Jingxuan He, Mengyu Zheng, Xiangyu Zeng, Xiyu Wang, Yunke Wang.

Figure 1
Figure 1. Figure 1: Motivation of our work. While the base image editing model [34] is capable of producing visually appealing results, it may introduce unexpected modifications such as human beautification (the first example) or slight viewpoint changes (the second example). Our edit localization framework is thus motivated to yield more faithful editing results. phenomenon of over-editing undermines a fundamental principle … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our framework. Given an editing instruction, an initial noise, and a source image, the model performs joint attention over text, target, and source tokens within each DiT layer. We first decompose and propagate attention to derive attention maps that provide coarse estimation of instruction-relevant regions (Attention-Based Semantic Estimation). The attention maps are leveraged to compute clust… view at source ↗
Figure 3
Figure 3. Figure 3: Quantitative results of edit localization analysis. There are two main observations: (i) The red curves generally achieve higher IoU than the orange and blue ones across denoising timesteps and task types, indicating that latent features provide stronger semantic cues. (ii) Semantic signals emerge from different image streams depending on the specific type of the editing task. Subject Addition “notebook” T… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of edit localization analysis. (i) Feature￾derived masks exhibit clearer boundaries and more complete spatial coverage than attention-derived ones. (ii) Edit semantics emerge from different image streams based on the task type. 4.2 Mask-Guided Latent Preservation Given the predicted edit mask Mˆ (𝑙) (𝑡), we now enforce spatially controlled image editing by constraining the evolution of … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons with state-of-the-art instruction-based image editing methods. For each case, we display the edited image on the left and its corresponding pixel-wise difference map on the right. The results show that our method achieves superior content preservation while retaining the original editing capability of the base model. (a) IoU Curves for Subject Replacement (Target Image Stream) (b) I… view at source ↗
Figure 6
Figure 6. Figure 6: Ablation on attention threshold and DiT layer. (top) The thresholds ranging from 0.3 to 0.7 yield similar IoU scores, indicat￾ing that our method is robust to the choice of attention threshold. (bottom) The segmentation performance improves as the layer depth increases, with IoU scores peaking at layer 50 (red curves). 5.3.3 Denoising timestep. We ablate the denoising timestep at which mask-guided latent p… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of editing masks. For subject replacement, the source-stream mask identifies the subject to be removed, while the target-stream mask outlines the subject to be rendered. Add text 'EXIT' on the image Change the background to forest Input Image Edited Image Edit Mask Edited Image (Ours) [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of fail cases. (top) The predicted edit mask can be inaccurate when the base model significantly modifies the layout of the image. (bottom) The predicted mask can be inaccurate when there are disjoint regions to be edited. source-stream mask identifies “golden glowing orb”, and the target￾stream mask reflects the outline of “lantern”. By taking the union of these masks for subject replacement, our… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison results of different methods. For each pair, the left figure shows the edited image, and the right figure displays the corresponding difference map relative to the input image. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
read the original abstract

Instruction-based image editing (IIE) aims to modify images according to textual instructions while preserving irrelevant content. Despite recent advances in diffusion transformers, existing methods often suffer from over-editing, introducing unintended changes to regions unrelated to the desired edit. We identify that this limitation arises from the lack of an explicit mechanism for edit localization. In particular, different editing operations (e.g., addition, removal and replacement) induce distinct spatial patterns, yet current IIE models typically treat localization in a task-agnostic manner. To address this limitation, we propose a training-free, task-aware edit localization framework that exploits the intrinsic source and target image streams within IIE models. For each image stream, We first obtain attention-based edit cues, and then construct feature centroids based on these attentive cues to partition tokens into edit and non-edit regions. Based on the observation that optimal localization is inherently task-dependent, we further introduce a unified mask construction strategy that selectively leverages source and target image streams for different editing tasks. We provide a systematic analysis for our proposed insights and approaches. Extensive experiments on EdiVal-Bench demonstrate our framework consistently improves non-edit region consistency while maintaining strong instruction-following performance on top of powerful recent image editing backbones, including Step1X-Edit and Qwen-Image-Edit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a training-free, task-aware edit localization framework for instruction-based image editing (IIE) in diffusion transformer models. It extracts attention-based edit cues separately from source and target image streams, derives feature centroids to partition tokens into edit versus non-edit regions, and applies a unified mask construction strategy that selectively uses the two streams depending on the editing operation (addition, removal, replacement). The central claim is that this approach improves non-edit region consistency on EdiVal-Bench while preserving instruction-following performance when plugged into recent backbones such as Step1X-Edit and Qwen-Image-Edit.

Significance. If the reported gains hold under detailed scrutiny, the work provides a practical, zero-training enhancement to existing IIE pipelines by making localization explicitly task-dependent rather than treating it as a side effect of the generative process. The emphasis on intrinsic attention patterns and stream-specific masking could reduce over-editing artifacts in a manner that generalizes across model architectures, which is valuable given the rapid iteration of diffusion-based editors.

major comments (2)
  1. The abstract states that the framework 'consistently improves non-edit region consistency' on EdiVal-Bench, yet no numerical deltas, baseline comparisons, or ablation tables are referenced. Without these quantitative anchors (e.g., specific metrics for consistency and instruction adherence), the magnitude and robustness of the claimed benefit cannot be assessed from the provided description.
  2. The weakest assumption—that attention-derived feature centroids reliably separate edit and non-edit tokens in a task-dependent way without introducing new artifacts—is load-bearing for the entire pipeline. The manuscript would benefit from explicit failure-case analysis or counter-examples where the centroid partitioning misclassifies regions, particularly for complex instructions involving multiple objects.
minor comments (2)
  1. The description of the 'unified mask construction strategy' is high-level; a concrete algorithmic outline or pseudocode would clarify how source versus target streams are chosen for each editing type.
  2. EdiVal-Bench is referenced as the evaluation suite, but its construction, task distribution, and exact evaluation protocols for 'non-edit consistency' versus 'instruction following' are not summarized, making it difficult to interpret cross-backbone results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The comments help strengthen the clarity of our claims and the robustness of our analysis. We respond to each major comment below and indicate the planned revisions.

read point-by-point responses
  1. Referee: The abstract states that the framework 'consistently improves non-edit region consistency' on EdiVal-Bench, yet no numerical deltas, baseline comparisons, or ablation tables are referenced. Without these quantitative anchors (e.g., specific metrics for consistency and instruction adherence), the magnitude and robustness of the claimed benefit cannot be assessed from the provided description.

    Authors: We appreciate the referee highlighting the need for quantitative specificity in the abstract. The abstract is written as a concise summary, while the detailed numerical results—including deltas on non-edit consistency metrics (e.g., LPIPS, PSNR, SSIM), instruction-following metrics (CLIP scores), baseline comparisons, and ablation studies—are provided in Section 4 and Tables 2–4. To address this point directly, we will revise the abstract to explicitly reference these key quantitative improvements and their consistency across backbones, thereby supplying the requested anchors while preserving the abstract's brevity. revision: yes

  2. Referee: The weakest assumption—that attention-derived feature centroids reliably separate edit and non-edit tokens in a task-dependent way without introducing new artifacts—is load-bearing for the entire pipeline. The manuscript would benefit from explicit failure-case analysis or counter-examples where the centroid partitioning misclassifies regions, particularly for complex instructions involving multiple objects.

    Authors: We agree that the reliability of the centroid partitioning is central and merits explicit scrutiny. Section 3 of the manuscript already presents a systematic analysis of attention cues and feature centroids, including visualizations and quantitative partitioning results that support task-dependent separation. To further strengthen the paper as suggested, we will add a dedicated failure-case analysis subsection (likely in Section 4 or the supplementary material) that examines potential misclassifications, with particular attention to complex multi-object instructions. This will include any observed counter-examples, discussion of introduced artifacts (if any), and how the unified mask construction strategy helps mitigate them. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper presents a training-free heuristic that extracts attention cues from pre-existing IIE backbones, computes feature centroids from those cues, and applies a task-dependent mask rule derived from observed spatial patterns in source/target streams. No equations reduce to fitted parameters, no self-referential definitions appear, and no load-bearing claims rest on self-citations or imported uniqueness theorems. Performance is shown via external benchmark metrics rather than internal construction, making the pipeline independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the assumption that attention patterns in diffusion transformers encode edit-relevant spatial information and that feature centroids can separate edit from non-edit tokens without additional supervision.

axioms (2)
  • domain assumption Attention maps within IIE diffusion transformers provide usable edit cues for both source and target streams
    Central to obtaining the initial edit cues before centroid construction.
  • domain assumption Optimal localization strategy is inherently task-dependent across addition, removal, and replacement operations
    Justifies the selective use of source versus target streams in the unified mask.

pith-pipeline@v0.9.0 · 5544 in / 1352 out tokens · 24038 ms · 2026-05-10T00:37:24.519343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 20 canonical work pages · 9 internal anchors

  1. [1]

    Samira Abnar and Willem Zuidema. 2020. Quantifying attention flow in trans- formers. InProceedings of the 58th annual meeting of the association for computa- tional linguistics. 4190–4197

  2. [2]

    Omri Avrahami, Dani Lischinski, and Ohad Fried. 2022. Blended diffusion for text-driven editing of natural images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18208–18218

  3. [3]

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. 2023. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18392–18402

  4. [4]

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. 2023. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF international confer- ence on computer vision. 22560–22570

  5. [5]

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, An- drew Huang, et al. 2025. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719(2025)

  6. [6]

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision. 9650–9660

  7. [7]

    Tianyu Chen, Yasi Zhang, Zhi Zhang, Peiyu Yu, Shu Wang, Zhendong Wang, Kevin Lin, Xiaofei Wang, Zhengyuan Yang, Linjie Li, et al. 2025. EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi- Turn Editing.arXiv preprint arXiv:2509.13399(2025)

  8. [8]

    Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. 2022. Diffedit: Diffusion-based semantic image editing with mask guidance.arXiv preprint arXiv:2210.11427(2022)

  9. [9]

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. InForty- first international conference on machine learning

  10. [10]

    Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan

  11. [11]

    Guiding instruction-based image editing via multimodal large language models.arXiv preprint arXiv:2309.17102(2023)

  12. [12]

    Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, and Weiyao Lin. 2025. Unleashing diffusion transformers for visual correspon- dence by modulating massive activations.arXiv preprint arXiv:2505.18584(2025)

  13. [13]

    Zhen Han, Zeyinzi Jiang, Yulin Pan, Jingfeng Zhang, Chaojie Mao, Chenwei Xie, Yu Liu, and Jingren Zhou. 2024. Ace: All-round creator and editor following instructions via diffusion transformer.arXiv preprint arXiv:2410.00086(2024)

  14. [14]

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626(2022)

  15. [15]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851

  16. [16]

    Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. 2024. Smartedit: Exploring complex instruction-based image editing with multimodal large lan- guage models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8362–8371

  17. [17]

    Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. 2023. Direct inversion: Boosting diffusion-based editing with 3 lines of code.arXiv preprint arXiv:2310.01506(2023)

  18. [18]

    Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. 2024. Vi- escore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12268–12290

  19. [19]

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le

  20. [20]

    Flow matching for generative modeling.arXiv preprint arXiv:2210.02747 (2022)

  21. [21]

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al . 2025. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761 (2025)

  22. [22]

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. 2024. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision. Springer, 38–55

  23. [23]

    Xingchao Liu, Chengyue Gong, and Qiang Liu. 2022. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003(2022)

  24. [24]

    Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. 2025. Ace++: Instruction-based image creation and editing via context-aware content filling. InProceedings of the IEEE/CVF International Con- ference on Computer Vision. 1958–1966

  25. [25]

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741(2021)

  26. [26]

    William Peebles and Saining Xie. 2023. Scalable diffusion models with transform- ers. InProceedings of the IEEE/CVF international conference on computer vision. 4195–4205

  27. [27]

    Yuandong Pu, Le Zhuo, Kaiwen Zhu, Liangbin Xie, Wenlong Zhang, Xiangyu Chen, Peng Gao, Yu Qiao, Chao Dong, and Yihao Liu. 2025. Lumina-omnilv: A unified multimodal framework for general low-level vision.arXiv preprint arXiv:2504.04903(2025)

  28. [28]

    Giorgio Roffo. 2026. Infinite Self-Attention.arXiv preprint arXiv:2603.00175 (2026)

  29. [29]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

  30. [30]

    Enis Simsar, Alessio Tonioni, Yongqin Xian, Thomas Hofmann, and Federico Tombari. 2025. Lime: Localized image editing via attention regularization in diffusion models. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). IEEE, 222–231

  31. [31]

    Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang

  32. [32]

    In Proceedings of the IEEE/CVF International Conference on Computer Vision

    Ominicontrol: Minimal and universal control for diffusion transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14940– 14950

  33. [33]

    Narek Tumanyan, Omer Bar-Tal, Shai Bagon, and Tali Dekel. 2022. Splicing vit features for semantic appearance transfer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10748–10757

  34. [34]

    Qian Wang, Biao Zhang, Michael Birsak, and Peter Wonka. 2023. Instructe- dit: Improving automatic masks for diffusion-based image editing with user instructions.arXiv preprint arXiv:2305.18047(2023)

  35. [35]

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing13, 4 (2004), 600–612

  36. [36]

    Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, and Nan Duan. 2021. Godiva: Generating open-domain videos from natural descriptions.arXiv preprint arXiv:2104.14806(2021)

  37. [37]

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng- ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al . 2025. Qwen-image technical report.arXiv preprint arXiv:2508.02324(2025)

  38. [38]

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. 2025. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871(2025)

  39. [39]

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. 2025. Omnigen: Unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13294–13304

  40. [40]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  41. [41]

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. 2023. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems36 (2023), 31428–31449

  42. [42]

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

  43. [43]

    InProceedings of the IEEE conference on computer vision and pattern recognition

    The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition. 586–595

  44. [44]

    Xuanpu Zhang, Xuesong Niu, Ruidong Chen, Dan Song, Jianhao Zeng, Penghui Du, Haoxiang Cao, Kai Wu, and An-an Liu. 2025. Group Relative Attention Guidance for Image Editing.arXiv preprint arXiv:2510.24657(2025)

  45. [45]

    Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. 2025. Enabling in- structional image editing with in-context generation in large scale diffusion transformer. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems

  46. [46]

    cat”), while the mask from the target stream captures the shape of the intended replacement (“stuffed animal

    Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. 2024. Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems37 (2024), 3058–3093. 9 Jingxuan He, Xiyu Wang, Mengyu Zheng, Xiangyu Zeng, Yunke Wang, and Chang Xu Supplementary...