pith. machine review for the scientific record. sign in

arxiv: 2605.07457 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

Bo Li, Guangtao Zhai, Huiyu Duan, Jian Zhang, Jinwei Chen, Mingda Du, Shusong Xu, Sijing Wu, Tianyi Zheng, Xiongkuo Min, Yifei Nie, Zitong Xu

Pith reviewed 2026-05-11 02:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-guided image editingimage refinementagentic frameworkhuman feedback datasetartifact detectionperceptual qualitylocalized editingself-corrective editing
0
0 comments X

The pith

EditRefiner uses a four-agent perception-reasoning-action-evaluation loop and a new human-feedback dataset to refine text-guided image edits with better localization and perceptual alignment than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-guided image editing models often leave behind unnatural objects, lighting mismatches, and unexpected changes. Existing fixes either regenerate entire images at high cost or rely on vision-language models that lack precise spatial awareness, leading to further semantic drift. The paper builds EditFHF-15K, a dataset of 15K edited images with 60K artifact regions, 80K failure regions, and 45K human opinion scores. It then introduces EditRefiner, a hierarchical agentic system that first detects salient artifact areas, reasons about their causes in human-like terms, executes targeted local corrections, and evaluates whether additional passes are needed. Experiments show the approach improves distortion localization, diagnostic accuracy, and human-aligned quality scores over current state-of-the-art refinement techniques.

Core claim

By reformulating post-editing correction as an explicit human-like perception-reasoning-action-evaluation loop and grounding it in the EditFHF-15K dataset of fine-grained human feedback, EditRefiner achieves more reliable detection of artifacts, more accurate diagnostic inference, and more precise localized re-editing without introducing new semantic drift.

What carries the argument

The four-agent loop consisting of a perception agent that outputs contextual saliency maps of artifacts and failures, a reasoning agent that performs diagnostic inference from those maps, an action agent that plans and executes localized re-editing, and an evaluation agent that decides whether further refinement is required.

Load-bearing premise

The four specialized agents, when guided by cues from the EditFHF-15K dataset, can reliably diagnose and correct localized editing problems without creating new artifacts or changing the intended meaning of the edit.

What would settle it

A held-out test set of editing failures from models or tasks outside the 12 TIE models used to build EditFHF-15K, where EditRefiner either fails to improve mean opinion scores or introduces measurable semantic changes compared with the original edit.

Figures

Figures reproduced from arXiv: 2605.07457 by Bo Li, Guangtao Zhai, Huiyu Duan, Jian Zhang, Jinwei Chen, Mingda Du, Shusong Xu, Sijing Wu, Tianyi Zheng, Xiongkuo Min, Yifei Nie, Zitong Xu.

Figure 1
Figure 1. Figure 1: Overview of our EditFHF-15K. (a) An illustration of our annotation interface, (b) the [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our EditRefiner. The framework operates as a perception-reasoning-action [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example results from advanced TIE models and results with our EditRefiner. Additional [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of saliency map prediction. Our method produces sharper and more precise [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Recent text-guided image editing (TIE) models have made remarkable progress, yet edited images still frequently suffer from fine-grained issues such as unnatural objects, lighting mismatch, and unexpected changes. Existing refinement approaches either rely on costly iterative regeneration or employ vision-language models (VLMs) with weak spatial grounding, often resulting in semantic drift and unreliable local corrections. To address these limitations, we first construct EditFHF-15K, a dataset of fine-grained human feedback for edited images, comprising (1) 15K images from 12 TIE models spanning 43 editing tasks, (2) 60K annotated artifact regions and 80K editing failure regions, each accompanied by textual reasoning, and (3) 45K mean opinion scores (MOSs) assessing perceptual quality, instruction following, and visual consistency. Based on EditFHF-15K, we propose EditRefiner, a hierarchical, interpretable, and human-aligned agentic framework that reformulates post-editing correction as a human-like perception-reasoning-action-evaluation loop. Specifically, we introduce: (1) a perception agent that detects contextual saliency maps of artifacts and editing failures, (2) a reasoning agent that interprets these perceptual cues to perform human-aligned diagnostic inference, (3) an action agent that uses the reasoning output to plan and execute localized re-editing, and (4) an evaluation agent that assesses the re-edited image and guides the action agent on whether further refinements are required. Extensive experiments demonstrate that EditRefiner consistently outperforms state-of-the-art methods in distortion localization, diagnose accuracy and human perception alignment, establishing a new paradigm for self-corrective and perceptually reliable image editing. The code is available at https://github.com/IntMeGroup/EditRefiner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper constructs EditFHF-15K, a dataset of 15K edited images from 12 TIE models across 43 tasks, with 60K annotated artifact regions, 80K failure regions, and 45K MOS scores. It proposes EditRefiner, a hierarchical agentic framework implementing a perception-reasoning-action-evaluation loop: a perception agent generates contextual saliency maps, a reasoning agent performs diagnostic inference, an action agent executes localized re-editing, and an evaluation agent assesses quality and decides on further iterations. The central claim is that this framework consistently outperforms prior refinement methods in distortion localization, diagnostic accuracy, and human perception alignment.

Significance. If the empirical results hold under rigorous validation, the work could introduce a practical self-corrective paradigm for text-guided image editing that mitigates semantic drift and weak spatial grounding in VLMs. The public code release and dataset construction from human feedback are strengths that support reproducibility and future benchmarking in the field.

major comments (2)
  1. [Experiments] The central claim of consistent outperformance in distortion localization and diagnostic accuracy rests on the four-agent loop reliably correcting artifacts without introducing new failures or semantic drift. This is load-bearing but under-supported if the action agent (which must translate reasoning into precise localized edits) inherits the spatial grounding weaknesses of the underlying VLMs, as the introduction itself acknowledges; the experiments section should include targeted failure-case analysis or quantitative checks for new artifact introduction post-refinement.
  2. [Dataset Construction and Experiments] EditFHF-15K is used both to drive the agents (via annotated regions and reasoning) and to measure success in localization/diagnosis accuracy. This creates a potential circularity risk where reported gains are partly dataset-specific rather than generalizable; the paper should clarify train/test splits, whether agents see held-out failure modes, and include cross-dataset or cross-model generalization results.
minor comments (2)
  1. [Abstract] The abstract asserts 'extensive experiments' and 'consistent outperformance' but supplies no key quantitative metrics, baseline names, or effect sizes; adding 1-2 headline numbers (e.g., localization IoU or accuracy deltas) would improve readability.
  2. [Method] Notation for the agent loop (perception saliency maps feeding into reasoning) could be clarified with a diagram or pseudocode in §3 to make the hierarchical flow explicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully addressed each major comment below and revised the paper to incorporate the suggested analyses and clarifications.

read point-by-point responses
  1. Referee: [Experiments] The central claim of consistent outperformance in distortion localization and diagnostic accuracy rests on the four-agent loop reliably correcting artifacts without introducing new failures or semantic drift. This is load-bearing but under-supported if the action agent (which must translate reasoning into precise localized edits) inherits the spatial grounding weaknesses of the underlying VLMs, as the introduction itself acknowledges; the experiments section should include targeted failure-case analysis or quantitative checks for new artifact introduction post-refinement.

    Authors: We agree that explicit verification of whether the action agent introduces new artifacts is necessary to fully support the central claim, given the acknowledged spatial grounding limitations of VLMs. While the original experiments reported net gains in localization accuracy and human perception metrics, they did not include dedicated checks for post-refinement artifact introduction. In the revised manuscript, we have added a new subsection in the Experiments section with targeted failure-case analysis. This includes qualitative examples of introduced artifacts and quantitative metrics comparing the number of annotated artifact regions before and after refinement on the held-out test data. These results indicate that new artifacts are introduced in a small minority of cases and do not offset the overall improvements. revision: yes

  2. Referee: [Dataset Construction and Experiments] EditFHF-15K is used both to drive the agents (via annotated regions and reasoning) and to measure success in localization/diagnosis accuracy. This creates a potential circularity risk where reported gains are partly dataset-specific rather than generalizable; the paper should clarify train/test splits, whether agents see held-out failure modes, and include cross-dataset or cross-model generalization results.

    Authors: We thank the referee for identifying this potential circularity concern. The EditFHF-15K dataset was constructed with a predefined 80/20 train/test split before any agent development. All agents (perception, reasoning, action, and evaluation) were trained and tuned exclusively on the training split using the annotated regions and reasoning, while all quantitative results for localization accuracy, diagnostic accuracy, and human alignment were computed solely on the held-out test split. No test data was used during agent training or hyperparameter selection. To further address generalizability, we have added cross-model experiments evaluating the framework on three additional TIE models not present in the original 12, as well as cross-dataset results on an external public image editing benchmark. These new results are included in the revised Experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework is self-contained

full rationale

The paper constructs EditFHF-15K as a new dataset of human feedback and then defines EditRefiner as a four-agent hierarchical loop (perception saliency maps, diagnostic reasoning, localized re-editing, evaluation-guided iteration). The central claims are empirical outperformance on distortion localization, diagnostic accuracy, and human perception alignment versus prior methods. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs. The dataset is used for both agent development and evaluation, which is standard ML practice and does not meet the criteria for circularity (no quoted self-definitional reduction or load-bearing self-citation of a uniqueness theorem). The framework is presented as a design choice rather than a derivation that collapses to its own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework depends on the assumption that human annotations in EditFHF-15K faithfully capture perceptual failures and that VLMs can be orchestrated into a reliable closed-loop correction process; no free parameters or new physical entities are described.

axioms (2)
  • domain assumption Human mean opinion scores and region annotations in the constructed dataset accurately reflect perceptual quality, instruction following, and visual consistency
    The entire training and evaluation of the agents rests on these annotations being reliable ground truth.
  • domain assumption Vision-language models possess sufficient spatial grounding when guided by the proposed perception and reasoning agents
    The paper contrasts its approach with prior VLMs that have weak grounding, implying the new agent structure overcomes this limitation.
invented entities (1)
  • Perception-reasoning-action-evaluation agent loop no independent evidence
    purpose: To implement human-aligned iterative refinement of image edits
    Newly proposed components whose effectiveness is asserted via the new dataset rather than independent external validation.

pith-pipeline@v0.9.0 · 5669 in / 1508 out tokens · 66600 ms · 2026-05-11T02:16:04.822444+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 10 internal anchors

  1. [1]

    In: Proceedings of the ACL Workshop on Intrinsic and Ex- trinsic Evaluation Measures for Machine Translation and/or Summarization (ACL Workshop)

    Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Ex- trinsic Evaluation Measures for Machine Translation and/or Summarization (ACL Workshop). pp. 65–72 (Jun 2005)

  2. [2]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing in- structions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18392–18402 (2023)

  3. [3]

    In: Weiss, Y ., Schölkopf, B., Platt, J

    Bruce, N., Tsotsos, J.: Saliency based on information maximization. In: Weiss, Y ., Schölkopf, B., Platt, J. (eds.) NIPS. vol. 18. MIT Press (2005),https://proceedings.neurips.cc/ paper_files/paper/2005/file/0738069b244a1c43c83112b735140a16-Paper.pdf

  4. [4]

    Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics tell us about saliency models? IEEE transactions on pattern analysis and machine intelligence (TPAMI)41(3), 740–757 (2019)

  5. [5]

    In: Proceedings of the IEEE/CVF international conference on computer vision (CVPR)

    Cao, M., Wang, X., Qi, Z., Shan, Y ., Qie, X., Zheng, Y .: Masactrl: Tuning-free mutual self- attention control for consistent image synthesis and editing. In: Proceedings of the IEEE/CVF international conference on computer vision (CVPR). pp. 22560–22570 (2023)

  6. [6]

    In: Platt, J., Koller, D., Singer, Y ., Roweis, S

    Cerf, M., Harel, J., Einhaeuser, W., Koch, C.: Predicting human gaze using low-level saliency combined with face detection. In: Platt, J., Koller, D., Singer, Y ., Roweis, S. (eds.) NIPS. vol. 20. Curran Associates, Inc. (2007),https://proceedings.neurips.cc/ paper_files/paper/2007/file/708f3cf8100d5e71834b1db77dfa15d6-Paper.pdf

  7. [7]

    In: Proceedings of the International Conference on Learning Representations (ICLR)

    Chan, C.M., Chen, W., Su, Y ., Yu, J., Xue, W., Zhang, S., et al.: Chateval: Towards better llm- based evaluators through multi-agent debate. In: Proceedings of the International Conference on Learning Representations (ICLR). pp. 1–9 (2023)

  8. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  9. [9]

    In: ICPR (2016)

    Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: A Deep Multi-Level Network for Saliency Prediction. In: ICPR (2016)

  10. [10]

    IEEE Transactions on Image Processing27(10), 5142–5154 (2018)

    Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model. IEEE Transactions on Image Processing27(10), 5142–5154 (2018)

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

    Duan, H., Hu, Q., Wang, J., Yang, L., Xu, Z., Liu, L., Min, X., Cai, C., Ye, T., Zhang, X., Zhai, G.: Finevq: Fine-grained user generated content video quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

  12. [12]

    In: Proceedings of the International Conference on Machine Learning (ICML) (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Proceedings of the International Conference on Machine Learning (ICML) (2024)

  13. [13]

    In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2021)

    Gao, T., Yao, X., Chen, D.: SimCSE: Simple contrastive learning of sentence embeddings. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2021)

  14. [14]

    In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    Goferman, S., Zelnik-Manor, L., Tal, A.: Context-aware saliency detection. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp. 2376–2383 (2010). https://doi.org/10.1109/CVPR.2010.5539929

  15. [15]

    Google DeepMind: Gemini 3.1 pro: Best for complex tasks and bringing creative concepts to life.https://deepmind.google/models/gemini/pro/(2025)

  16. [16]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016) 11

  17. [17]

    In: Proceedings of the International Conference on Learning Representations (ICLR) (2022)

    Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., et al.: Lora: Low-rank adap- tation of large language models. In: Proceedings of the International Conference on Learning Representations (ICLR) (2022)

  18. [18]

    Understanding the planning of LLM agents: A survey

    Huang, X., Liu, W., Chen, X., Wang, X., Wang, H., Lian, D., Wang, Y ., Tang, R., Chen, E.: Understanding the planning of llm agents: A survey. arXiv preprint arXiv:2402.02716 (2024)

  19. [19]

    In: ICCV (December 2015)

    Huang, X., Shen, C., Boix, X., Zhao, Q.: Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In: ICCV (December 2015)

  20. [20]

    arXiv preprint arXiv:2404.09990 , year=

    Hui, M., Yang, S., Zhao, B., Shi, Y ., Wang, H., Wang, P., et al.: Hq-edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990 (2024)

  21. [21]

    (ITU), I.T.U.: Methodology for the subjective assessment of the quality of television pictures. Tech. Rep. Rec. ITU-R BT.500-13, International Telecommunication Union (ITU) (Jan 2012)

  22. [22]

    In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)

    Ju, X., Zeng, A., Bian, Y ., Liu, S., Xu, Q.: Pnp inversion: Boosting diffusion-based editing with 3 lines of code. In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)

  23. [23]

    In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)

    Kulikov, V ., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion-free text-based editing using pre-trained flow models. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). pp. 19721–19730 (2025)

  24. [24]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    Lao, S., Gong, Y ., Shi, S., Yang, S., Wu, T., Wang, J., et al.: Attentions help cnns see better: Attention-based hybrid image quality assessment network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 1140–1149 (2022)

  25. [25]

    Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

    Li, H., Zhang, M., Zheng, D., Guo, Z., Jia, Y ., Feng, K., et al.: Editthinker: Unlocking iterative reasoning for any image editor. arXiv preprint arXiv:2512.05965 (2025)

  26. [26]

    Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

    Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y ., Wang, R., et al.: Encouraging divergent think- ing in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118 (2023)

  27. [27]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Liang, Y ., He, J., Li, G., Li, P., Klimovskiy, A., Carolan, N., Sun, J., Pont-Tuset, J., Young, S., Yang, F., et al.: Rich human feedback for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19401– 19411 (2024)

  28. [28]

    In: Text Summarization Branches Out

    Lin, C.Y .: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out. pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (Jul 2004),https://aclanthology.org/W04-1013/

  29. [29]

    IEEE Journal of Selected Topics in Signal Processing (2025)

    Liu, L., Cai, C., Shen, S., Liang, J., Ouyang, W., Ye, T., Mao, J., Duan, H., Yao, J., Zhang, X., et al.: Moa-vr: A mixture-of-agents system towards all-in-one video restoration. IEEE Journal of Selected Topics in Signal Processing (2025)

  30. [30]

    Step1X-Edit: A Practical Framework for General Image Editing

    Liu, S., Han, Y ., Xing, P., Yin, F., Wang, R., Cheng, W., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025)

  31. [31]

    Neurocomputing507, 250–264 (2022)

    Lou, J., Ma, L., Hu, K.X., Yang, H., Lin, W.Y .: Transalnet: Towards perceptually relevant visual saliency prediction. Neurocomputing507, 250–264 (2022)

  32. [32]

    5 technical report , author=

    Lu, S., Li, Y ., Xia, Y ., Hu, Y ., Zhao, S., Ma, Y ., et al.: Ovis2.5 technical report. arXiv:2508.11737 (2025)

  33. [33]

    Editscore: Unlocking online rl for image editing via high-fidelity reward modeling.arXiv preprint arXiv:2509.23909, 2025

    Luo, X., Wang, J., Wu, C., Xiao, S., Jiang, X., Lian, D., et al.: Editscore: Unlocking online rl for image editing via high-fidelity reward modeling. arXiv preprint arXiv:2509.23909 (2025)

  34. [34]

    arXiv preprint arXiv:2508.02363 (2025) 12

    Lupascu, M., Stupariu, M.S.: Optimal transport for rectified flow image editing: Unifying inversion-based and direct methods. arXiv preprint arXiv:2508.02363 (2025) 12

  35. [35]

    Ace++: Instruction- based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025

    Mao, C., Zhang, J., Pan, Y ., Jiang, Z., Han, Z., Liu, Y ., Zhou, J.: Ace++: Instruction-based image creation and editing via context-aware content filling. arXiv preprint arXiv:2501.02487 (2025)

  36. [36]

    Efficient Estimation of Word Representations in Vector Space

    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  37. [37]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Nam, H., Kwon, G., Park, G.Y ., Ye, J.C.: Contrastive denoising score for text-guided latent diffusion image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9192–9201 (June 2024)

  38. [38]

    OpenAI: Chatgpt 5.https://openai.com/gpt-5/(2025)

  39. [39]

    OpenAI: Gpt-5.https://www.openai.com(2025)

  40. [40]

    Pico-Banana-400K: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

    Qian, Y ., Bocek-Rivele, E., Song, L., Tong, J., Yang, Y ., Lu, J., et al.: Pico-banana-400k: A large-scale dataset for text-guided image editing. arXiv preprint arXiv:2510.19808 (2025)

  41. [41]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthe- sis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684–10695 (2022)

  42. [42]

    arXiv preprint (2025)

    Seedream, T., Chen, Y ., Gao, Y ., Gong, L., Guo, M., Guo, Q., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint (2025)

  43. [43]

    IEEE Transactions on Image Processing (TIP)15(2), 430–444 (2006)

    Sheikh, H.R., Bovik, A.C.: Image information and visual quality. IEEE Transactions on Image Processing (TIP)15(2), 430–444 (2006)

  44. [44]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)

    Shen, S., Liang, J., Cai, C., Geng, C., Duan, H., Zhang, X., et al.: Agentic retoucher for text- to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)

  45. [45]

    IEEE Transactions on Multimedia (TMM) (2026)

    Wang, J., Duan, H., Zhai, G., Min, X.: Quality assessment for ai generated images with in- struction tuning. IEEE Transactions on Multimedia (TMM) (2026)

  46. [46]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Wang, J., Duan, H., Zhao, Y ., Wang, J., Zhai, G., Min, X.: Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 17312–17323 (2025)

  47. [47]

    I2i-bench: A comprehensive benchmark suite for image-to-image editing models.arXiv preprint arXiv:2512.04660, 2025

    Wang, J., Wang, J., Duan, H., Kang, J., Zhai, G., Min, X.: I2i-bench: A comprehensive benchmark suite for image-to-image editing models. arXiv preprint arXiv:2512.04660 (2025)

  48. [48]

    arXiv preprint arXiv:2509.04545 (2025)

    Wang, L., Xing, X., Cheng, Y ., Zhao, Z., Donghao, L., Tiankai, H., et al.: Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting. arXiv preprint arXiv:2509.04545 (2025)

  49. [49]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

  50. [50]

    IEEE Transactions on Image Processing (TIP)13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP)13(4), 600–612 (2004)

  51. [51]

    Skywork unipic 3.0: Unified multi-image composition via sequence modeling.arXiv preprint arXiv:2601.15664, 2026

    Wei, H., Liu, H., Wang, Z., Peng, Y ., Xu, B., Wu, S., et al.: Skywork unipic 3.0: Unified multi-image composition via sequence modeling. arXiv preprint arXiv:2601.15664 (2026)

  52. [52]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

  53. [53]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y ., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025)

  54. [54]

    arXiv preprint arXiv:2312.17090 (2023)

    Wu, H., Zhang, Z., Zhang, W., Chen, C., Li, C., Liao, L., et al.: Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023) 13

  55. [55]

    Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

    Wu, K., Jiang, S., Ku, M., Nie, P., Liu, M., Chen, W.: Editreward: A human-aligned reward model for instruction-guided image editing. arXiv preprint arXiv:2509.26346 (2025)

  56. [56]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Wu, Q., Bansal, G., Zhang, J., Wu, Y ., Li, B., Zhu, E., et al.: Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155 (2023)

  57. [58]

    arXiv preprint arXiv:2505.16707 (2025)

    Wu, Y ., Li, Z., Hu, X., Ye, X., Zeng, X., Yu, G., et al.: Kris-bench: Benchmarking next-level intelligent image editing models. arXiv preprint arXiv:2505.16707 (2025)

  58. [59]

    arXiv preprint arXiv:2510.06679 (2025)

    Xia, B., Peng, B., Zhang, Y ., Huang, J., Liu, J., Li, J., et al.: Dreamomni2: Multimodal instruction-based editing and generation. arXiv preprint arXiv:2510.06679 (2025)

  59. [60]

    Edithf-1m: A million-scale rich hu- man preference feedback for image editing.arXiv preprint arXiv:2603.14916, 2026

    Xu, Z., Duan, H., Ji, Z., Zhang, X., Liu, Y ., Min, X., et al.: Edithf-1m: A million-scale rich human preference feedback for image editing. arXiv preprint arXiv:2603.14916 (2026)

  60. [61]

    In: Proceedings of the ACM International Conference on Multimedia (ACM MM)

    Xu, Z., Duan, H., Liu, B., Ma, G., Wang, J., Yang, L., et al.: Lmm4edit: Benchmarking and evaluating multimodal image editing with lmms. In: Proceedings of the ACM International Conference on Multimedia (ACM MM). pp. 6908–6917 (October 2025)

  61. [62]

    In: IEEE International Confer- ence on Multimedia and Expo (ICME)

    Xu, Z., Duan, H., Ma, G., Yang, L., Wang, J., Wu, Q., et al.: Harmonyiqa: Pioneering bench- mark and model for image harmonization quality assessment. In: IEEE International Confer- ence on Multimedia and Expo (ICME). pp. 1–6 (2025)

  62. [63]

    Xu, Z., Shen, D., Du, Y ., Hao, K., Huang, J., Huang, X.: Magicwand: A universal agent for generation and evaluation aligned with user preference (2025)

  63. [64]

    IEEE Transactions on Image Processing (TIP)23(2), 684–695 (2013)

    Xue, W., Zhang, L., Mou, X., Bovik, A.C.: Gradient magnitude similarity deviation: A highly efficient perceptual image quality index. IEEE Transactions on Image Processing (TIP)23(2), 684–695 (2013)

  64. [65]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

  65. [66]

    IEEE Transactions on Circuits and Systems for Video Technology (2025)

    Yang, L., Duan, H., Wang, J., Liu, J., Hu, M., Min, X., Zhai, G., Le Callet, P.: Quality assessment and distortion-aware saliency prediction for ai-generated omnidirectional images. IEEE Transactions on Circuits and Systems for Video Technology (2025)

  66. [67]

    arXiv preprint arXiv:2602.09084 (2026)

    Ye, R., Zhang, J., Liu, Z., Zhu, Z., Yang, S., Li, L., et al.: Agent banana: High-fidelity image editing with agentic thinking and tooling. arXiv preprint arXiv:2602.09084 (2026)

  67. [68]

    In: Proceedings of the Conference on Associa- tion for the Advancement of Artificial Intelligence (AAAI)

    Yin, G., Wang, W., Yuan, Z., Han, C., Ji, W., Sun, S., et al.: Content-variant reference image quality assessment via knowledge distillation. In: Proceedings of the Conference on Associa- tion for the Advancement of Artificial Intelligence (AAAI). vol. 36, pp. 3134–3142 (2022)

  68. [69]

    In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2023)

    Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y .: Magicbrush: A manually annotated dataset for instruction-guided image editing. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2023)

  69. [70]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

  70. [71]

    In: 2011 IEEE CVPR

    Zhao, Q., Cai, J.: Visual saliency detection by spatially weighted dissimilarity. In: 2011 IEEE CVPR. pp. 1241–1248. IEEE (2011)

  71. [72]

    arXiv preprint arXiv:2410.17809 (2024) 14

    Zhu, K., Gu, J., You, Z., Qiao, Y ., Dong, C.: An intelligent agentic system for complex image restoration problems. arXiv preprint arXiv:2410.17809 (2024) 14