arxiv: 2605.07457 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

Bo Li, Guangtao Zhai, Huiyu Duan, Jian Zhang, Jinwei Chen, Mingda Du, Shusong Xu, Sijing Wu, Tianyi Zheng, Xiongkuo Min, Yifei Nie, Zitong Xu

Pith reviewed 2026-05-11 02:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-guided image editingimage refinementagentic frameworkhuman feedback datasetartifact detectionperceptual qualitylocalized editingself-corrective editing

0 comments

The pith

EditRefiner uses a four-agent perception-reasoning-action-evaluation loop and a new human-feedback dataset to refine text-guided image edits with better localization and perceptual alignment than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-guided image editing models often leave behind unnatural objects, lighting mismatches, and unexpected changes. Existing fixes either regenerate entire images at high cost or rely on vision-language models that lack precise spatial awareness, leading to further semantic drift. The paper builds EditFHF-15K, a dataset of 15K edited images with 60K artifact regions, 80K failure regions, and 45K human opinion scores. It then introduces EditRefiner, a hierarchical agentic system that first detects salient artifact areas, reasons about their causes in human-like terms, executes targeted local corrections, and evaluates whether additional passes are needed. Experiments show the approach improves distortion localization, diagnostic accuracy, and human-aligned quality scores over current state-of-the-art refinement techniques.

Core claim

By reformulating post-editing correction as an explicit human-like perception-reasoning-action-evaluation loop and grounding it in the EditFHF-15K dataset of fine-grained human feedback, EditRefiner achieves more reliable detection of artifacts, more accurate diagnostic inference, and more precise localized re-editing without introducing new semantic drift.

What carries the argument

The four-agent loop consisting of a perception agent that outputs contextual saliency maps of artifacts and failures, a reasoning agent that performs diagnostic inference from those maps, an action agent that plans and executes localized re-editing, and an evaluation agent that decides whether further refinement is required.

Load-bearing premise

The four specialized agents, when guided by cues from the EditFHF-15K dataset, can reliably diagnose and correct localized editing problems without creating new artifacts or changing the intended meaning of the edit.

What would settle it

A held-out test set of editing failures from models or tasks outside the 12 TIE models used to build EditFHF-15K, where EditRefiner either fails to improve mean opinion scores or introduces measurable semantic changes compared with the original edit.

Figures

Figures reproduced from arXiv: 2605.07457 by Bo Li, Guangtao Zhai, Huiyu Duan, Jian Zhang, Jinwei Chen, Mingda Du, Shusong Xu, Sijing Wu, Tianyi Zheng, Xiongkuo Min, Yifei Nie, Zitong Xu.

**Figure 2.** Figure 2: Overview of our EditRefiner. The framework operates as a perception-reasoning-action [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Example results from advanced TIE models and results with our EditRefiner. Additional [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of saliency map prediction. Our method produces sharper and more precise [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Recent text-guided image editing (TIE) models have made remarkable progress, yet edited images still frequently suffer from fine-grained issues such as unnatural objects, lighting mismatch, and unexpected changes. Existing refinement approaches either rely on costly iterative regeneration or employ vision-language models (VLMs) with weak spatial grounding, often resulting in semantic drift and unreliable local corrections. To address these limitations, we first construct EditFHF-15K, a dataset of fine-grained human feedback for edited images, comprising (1) 15K images from 12 TIE models spanning 43 editing tasks, (2) 60K annotated artifact regions and 80K editing failure regions, each accompanied by textual reasoning, and (3) 45K mean opinion scores (MOSs) assessing perceptual quality, instruction following, and visual consistency. Based on EditFHF-15K, we propose EditRefiner, a hierarchical, interpretable, and human-aligned agentic framework that reformulates post-editing correction as a human-like perception-reasoning-action-evaluation loop. Specifically, we introduce: (1) a perception agent that detects contextual saliency maps of artifacts and editing failures, (2) a reasoning agent that interprets these perceptual cues to perform human-aligned diagnostic inference, (3) an action agent that uses the reasoning output to plan and execute localized re-editing, and (4) an evaluation agent that assesses the re-edited image and guides the action agent on whether further refinements are required. Extensive experiments demonstrate that EditRefiner consistently outperforms state-of-the-art methods in distortion localization, diagnose accuracy and human perception alignment, establishing a new paradigm for self-corrective and perceptually reliable image editing. The code is available at https://github.com/IntMeGroup/EditRefiner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EditRefiner adds a sizable human-annotated dataset on editing failures and a four-agent loop that tries to fix them more locally than prior VLM methods, though the reported gains rest on how cleanly the experiments separate the loop from the data it was built on.

read the letter

The main takeaway is a new dataset called EditFHF-15K plus a perception-reasoning-action-evaluation agent setup for cleaning up text-guided image edits. The dataset pulls 15K images from 12 models across 43 tasks, marks 60K artifact regions and 80K failure regions with textual notes, and includes 45K MOS ratings on quality and consistency. That scale and level of human detail is useful on its own for anyone working on generative editing tools. The agent loop then uses saliency maps to spot problems, reasons about them in human-like terms, plans localized re-edits, and checks the result before looping. This structure directly targets the weak spatial grounding and semantic drift that earlier VLM refinement approaches often show, and releasing the code makes it easier to test. The abstract claims clear wins on localization accuracy, diagnosis, and perceptual alignment, which lines up with the practical goal of reducing those fine-grained failures like lighting mismatches or unwanted object changes. On the softer side, the central results depend on the same dataset both shaping the agents and measuring their success, so any reported improvements could partly reflect dataset-specific tuning rather than general robustness. The action step in particular needs to show it can execute precise local changes without creating new drift, and the paper would be stronger with explicit ablations on each agent and held-out test splits. This is aimed at CV researchers focused on image editing pipelines or agentic refinement systems. Readers who need fresh human feedback data or ideas for breaking down correction tasks would get concrete value from it. It deserves peer review because the dataset and the hierarchical loop are real, buildable contributions even if the evaluation needs tighter controls to fully convince.

Referee Report

2 major / 2 minor

Summary. The paper constructs EditFHF-15K, a dataset of 15K edited images from 12 TIE models across 43 tasks, with 60K annotated artifact regions, 80K failure regions, and 45K MOS scores. It proposes EditRefiner, a hierarchical agentic framework implementing a perception-reasoning-action-evaluation loop: a perception agent generates contextual saliency maps, a reasoning agent performs diagnostic inference, an action agent executes localized re-editing, and an evaluation agent assesses quality and decides on further iterations. The central claim is that this framework consistently outperforms prior refinement methods in distortion localization, diagnostic accuracy, and human perception alignment.

Significance. If the empirical results hold under rigorous validation, the work could introduce a practical self-corrective paradigm for text-guided image editing that mitigates semantic drift and weak spatial grounding in VLMs. The public code release and dataset construction from human feedback are strengths that support reproducibility and future benchmarking in the field.

major comments (2)

[Experiments] The central claim of consistent outperformance in distortion localization and diagnostic accuracy rests on the four-agent loop reliably correcting artifacts without introducing new failures or semantic drift. This is load-bearing but under-supported if the action agent (which must translate reasoning into precise localized edits) inherits the spatial grounding weaknesses of the underlying VLMs, as the introduction itself acknowledges; the experiments section should include targeted failure-case analysis or quantitative checks for new artifact introduction post-refinement.
[Dataset Construction and Experiments] EditFHF-15K is used both to drive the agents (via annotated regions and reasoning) and to measure success in localization/diagnosis accuracy. This creates a potential circularity risk where reported gains are partly dataset-specific rather than generalizable; the paper should clarify train/test splits, whether agents see held-out failure modes, and include cross-dataset or cross-model generalization results.

minor comments (2)

[Abstract] The abstract asserts 'extensive experiments' and 'consistent outperformance' but supplies no key quantitative metrics, baseline names, or effect sizes; adding 1-2 headline numbers (e.g., localization IoU or accuracy deltas) would improve readability.
[Method] Notation for the agent loop (perception saliency maps feeding into reasoning) could be clarified with a diagram or pseudocode in §3 to make the hierarchical flow explicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully addressed each major comment below and revised the paper to incorporate the suggested analyses and clarifications.

read point-by-point responses

Referee: [Experiments] The central claim of consistent outperformance in distortion localization and diagnostic accuracy rests on the four-agent loop reliably correcting artifacts without introducing new failures or semantic drift. This is load-bearing but under-supported if the action agent (which must translate reasoning into precise localized edits) inherits the spatial grounding weaknesses of the underlying VLMs, as the introduction itself acknowledges; the experiments section should include targeted failure-case analysis or quantitative checks for new artifact introduction post-refinement.

Authors: We agree that explicit verification of whether the action agent introduces new artifacts is necessary to fully support the central claim, given the acknowledged spatial grounding limitations of VLMs. While the original experiments reported net gains in localization accuracy and human perception metrics, they did not include dedicated checks for post-refinement artifact introduction. In the revised manuscript, we have added a new subsection in the Experiments section with targeted failure-case analysis. This includes qualitative examples of introduced artifacts and quantitative metrics comparing the number of annotated artifact regions before and after refinement on the held-out test data. These results indicate that new artifacts are introduced in a small minority of cases and do not offset the overall improvements. revision: yes
Referee: [Dataset Construction and Experiments] EditFHF-15K is used both to drive the agents (via annotated regions and reasoning) and to measure success in localization/diagnosis accuracy. This creates a potential circularity risk where reported gains are partly dataset-specific rather than generalizable; the paper should clarify train/test splits, whether agents see held-out failure modes, and include cross-dataset or cross-model generalization results.

Authors: We thank the referee for identifying this potential circularity concern. The EditFHF-15K dataset was constructed with a predefined 80/20 train/test split before any agent development. All agents (perception, reasoning, action, and evaluation) were trained and tuned exclusively on the training split using the annotated regions and reasoning, while all quantitative results for localization accuracy, diagnostic accuracy, and human alignment were computed solely on the held-out test split. No test data was used during agent training or hyperparameter selection. To further address generalizability, we have added cross-model experiments evaluating the framework on three additional TIE models not present in the original 12, as well as cross-dataset results on an external public image editing benchmark. These new results are included in the revised Experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework is self-contained

full rationale

The paper constructs EditFHF-15K as a new dataset of human feedback and then defines EditRefiner as a four-agent hierarchical loop (perception saliency maps, diagnostic reasoning, localized re-editing, evaluation-guided iteration). The central claims are empirical outperformance on distortion localization, diagnostic accuracy, and human perception alignment versus prior methods. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs. The dataset is used for both agent development and evaluation, which is standard ML practice and does not meet the criteria for circularity (no quoted self-definitional reduction or load-bearing self-citation of a uniqueness theorem). The framework is presented as a design choice rather than a derivation that collapses to its own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework depends on the assumption that human annotations in EditFHF-15K faithfully capture perceptual failures and that VLMs can be orchestrated into a reliable closed-loop correction process; no free parameters or new physical entities are described.

axioms (2)

domain assumption Human mean opinion scores and region annotations in the constructed dataset accurately reflect perceptual quality, instruction following, and visual consistency
The entire training and evaluation of the agents rests on these annotations being reliable ground truth.
domain assumption Vision-language models possess sufficient spatial grounding when guided by the proposed perception and reasoning agents
The paper contrasts its approach with prior VLMs that have weak grounding, implying the new agent structure overcomes this limitation.

invented entities (1)

Perception-reasoning-action-evaluation agent loop no independent evidence
purpose: To implement human-aligned iterative refinement of image edits
Newly proposed components whose effectiveness is asserted via the new dataset rather than independent external validation.

pith-pipeline@v0.9.0 · 5669 in / 1508 out tokens · 66600 ms · 2026-05-11T02:16:04.822444+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
hierarchical, interpretable, and human-aligned agentic framework that reformulates post-editing correction as a human-like perception-reasoning-action-evaluation loop
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
perception agent that detects contextual saliency maps of artifacts and editing failures

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 10 internal anchors

[1]

In: Proceedings of the ACL Workshop on Intrinsic and Ex- trinsic Evaluation Measures for Machine Translation and/or Summarization (ACL Workshop)

Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Ex- trinsic Evaluation Measures for Machine Translation and/or Summarization (ACL Workshop). pp. 65–72 (Jun 2005)

work page 2005
[2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing in- structions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18392–18402 (2023)

work page 2023
[3]

In: Weiss, Y ., Schölkopf, B., Platt, J

Bruce, N., Tsotsos, J.: Saliency based on information maximization. In: Weiss, Y ., Schölkopf, B., Platt, J. (eds.) NIPS. vol. 18. MIT Press (2005),https://proceedings.neurips.cc/ paper_files/paper/2005/file/0738069b244a1c43c83112b735140a16-Paper.pdf

work page arXiv 2005
[4]

Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics tell us about saliency models? IEEE transactions on pattern analysis and machine intelligence (TPAMI)41(3), 740–757 (2019)

work page 2019
[5]

In: Proceedings of the IEEE/CVF international conference on computer vision (CVPR)

Cao, M., Wang, X., Qi, Z., Shan, Y ., Qie, X., Zheng, Y .: Masactrl: Tuning-free mutual self- attention control for consistent image synthesis and editing. In: Proceedings of the IEEE/CVF international conference on computer vision (CVPR). pp. 22560–22570 (2023)

work page 2023
[6]

In: Platt, J., Koller, D., Singer, Y ., Roweis, S

Cerf, M., Harel, J., Einhaeuser, W., Koch, C.: Predicting human gaze using low-level saliency combined with face detection. In: Platt, J., Koller, D., Singer, Y ., Roweis, S. (eds.) NIPS. vol. 20. Curran Associates, Inc. (2007),https://proceedings.neurips.cc/ paper_files/paper/2007/file/708f3cf8100d5e71834b1db77dfa15d6-Paper.pdf

work page 2007
[7]

In: Proceedings of the International Conference on Learning Representations (ICLR)

Chan, C.M., Chen, W., Su, Y ., Yu, J., Xue, W., Zhang, S., et al.: Chateval: Towards better llm- based evaluators through multi-agent debate. In: Proceedings of the International Conference on Learning Representations (ICLR). pp. 1–9 (2023)

work page 2023
[8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

In: ICPR (2016)

Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: A Deep Multi-Level Network for Saliency Prediction. In: ICPR (2016)

work page 2016
[10]

IEEE Transactions on Image Processing27(10), 5142–5154 (2018)

Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model. IEEE Transactions on Image Processing27(10), 5142–5154 (2018)

work page 2018
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

Duan, H., Hu, Q., Wang, J., Yang, L., Xu, Z., Liu, L., Min, X., Cai, C., Ye, T., Zhang, X., Zhai, G.: Finevq: Fine-grained user generated content video quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

work page 2025
[12]

In: Proceedings of the International Conference on Machine Learning (ICML) (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Proceedings of the International Conference on Machine Learning (ICML) (2024)

work page 2024
[13]

In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2021)

Gao, T., Yao, X., Chen, D.: SimCSE: Simple contrastive learning of sentence embeddings. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2021)

work page 2021
[14]

In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Goferman, S., Zelnik-Manor, L., Tal, A.: Context-aware saliency detection. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp. 2376–2383 (2010). https://doi.org/10.1109/CVPR.2010.5539929

work page doi:10.1109/cvpr.2010.5539929 2010
[15]

Google DeepMind: Gemini 3.1 pro: Best for complex tasks and bringing creative concepts to life.https://deepmind.google/models/gemini/pro/(2025)

work page 2025
[16]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016) 11

work page 2016
[17]

In: Proceedings of the International Conference on Learning Representations (ICLR) (2022)

Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., et al.: Lora: Low-rank adap- tation of large language models. In: Proceedings of the International Conference on Learning Representations (ICLR) (2022)

work page 2022
[18]

Understanding the planning of LLM agents: A survey

Huang, X., Liu, W., Chen, X., Wang, X., Wang, H., Lian, D., Wang, Y ., Tang, R., Chen, E.: Understanding the planning of llm agents: A survey. arXiv preprint arXiv:2402.02716 (2024)

work page internal anchor Pith review arXiv 2024
[19]

In: ICCV (December 2015)

Huang, X., Shen, C., Boix, X., Zhao, Q.: Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In: ICCV (December 2015)

work page 2015
[20]

arXiv preprint arXiv:2404.09990 , year=

Hui, M., Yang, S., Zhao, B., Shi, Y ., Wang, H., Wang, P., et al.: Hq-edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990 (2024)

work page arXiv 2024
[21]

(ITU), I.T.U.: Methodology for the subjective assessment of the quality of television pictures. Tech. Rep. Rec. ITU-R BT.500-13, International Telecommunication Union (ITU) (Jan 2012)

work page 2012
[22]

In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)

Ju, X., Zeng, A., Bian, Y ., Liu, S., Xu, Q.: Pnp inversion: Boosting diffusion-based editing with 3 lines of code. In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)

work page 2024
[23]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)

Kulikov, V ., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion-free text-based editing using pre-trained flow models. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). pp. 19721–19730 (2025)

work page 2025
[24]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

Lao, S., Gong, Y ., Shi, S., Yang, S., Wu, T., Wang, J., et al.: Attentions help cnns see better: Attention-based hybrid image quality assessment network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 1140–1149 (2022)

work page 2022
[25]

Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

Li, H., Zhang, M., Zheng, D., Guo, Z., Jia, Y ., Feng, K., et al.: Editthinker: Unlocking iterative reasoning for any image editor. arXiv preprint arXiv:2512.05965 (2025)

work page arXiv 2025
[26]

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y ., Wang, R., et al.: Encouraging divergent think- ing in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118 (2023)

work page internal anchor Pith review arXiv 2023
[27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Liang, Y ., He, J., Li, G., Li, P., Klimovskiy, A., Carolan, N., Sun, J., Pont-Tuset, J., Young, S., Yang, F., et al.: Rich human feedback for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19401– 19411 (2024)

work page 2024
[28]

In: Text Summarization Branches Out

Lin, C.Y .: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out. pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (Jul 2004),https://aclanthology.org/W04-1013/

work page 2004
[29]

IEEE Journal of Selected Topics in Signal Processing (2025)

Liu, L., Cai, C., Shen, S., Liang, J., Ouyang, W., Ye, T., Mao, J., Duan, H., Yao, J., Zhang, X., et al.: Moa-vr: A mixture-of-agents system towards all-in-one video restoration. IEEE Journal of Selected Topics in Signal Processing (2025)

work page 2025
[30]

Step1X-Edit: A Practical Framework for General Image Editing

Liu, S., Han, Y ., Xing, P., Yin, F., Wang, R., Cheng, W., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025)

work page internal anchor Pith review arXiv 2025
[31]

Neurocomputing507, 250–264 (2022)

Lou, J., Ma, L., Hu, K.X., Yang, H., Lin, W.Y .: Transalnet: Towards perceptually relevant visual saliency prediction. Neurocomputing507, 250–264 (2022)

work page 2022
[32]

5 technical report , author=

Lu, S., Li, Y ., Xia, Y ., Hu, Y ., Zhao, S., Ma, Y ., et al.: Ovis2.5 technical report. arXiv:2508.11737 (2025)

work page arXiv 2025
[33]

Editscore: Unlocking online rl for image editing via high-fidelity reward modeling.arXiv preprint arXiv:2509.23909, 2025

Luo, X., Wang, J., Wu, C., Xiao, S., Jiang, X., Lian, D., et al.: Editscore: Unlocking online rl for image editing via high-fidelity reward modeling. arXiv preprint arXiv:2509.23909 (2025)

work page arXiv 2025
[34]

arXiv preprint arXiv:2508.02363 (2025) 12

Lupascu, M., Stupariu, M.S.: Optimal transport for rectified flow image editing: Unifying inversion-based and direct methods. arXiv preprint arXiv:2508.02363 (2025) 12

work page arXiv 2025
[35]

Ace++: Instruction- based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025

Mao, C., Zhang, J., Pan, Y ., Jiang, Z., Han, Z., Liu, Y ., Zhou, J.: Ace++: Instruction-based image creation and editing via context-aware content filling. arXiv preprint arXiv:2501.02487 (2025)

work page arXiv 2025
[36]

Efficient Estimation of Word Representations in Vector Space

Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[37]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Nam, H., Kwon, G., Park, G.Y ., Ye, J.C.: Contrastive denoising score for text-guided latent diffusion image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9192–9201 (June 2024)

work page 2024
[38]

OpenAI: Chatgpt 5.https://openai.com/gpt-5/(2025)

work page 2025
[39]

OpenAI: Gpt-5.https://www.openai.com(2025)

work page 2025
[40]

Pico-Banana-400K: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

Qian, Y ., Bocek-Rivele, E., Song, L., Tong, J., Yang, Y ., Lu, J., et al.: Pico-banana-400k: A large-scale dataset for text-guided image editing. arXiv preprint arXiv:2510.19808 (2025)

work page arXiv 2025
[41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthe- sis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684–10695 (2022)

work page 2022
[42]

arXiv preprint (2025)

Seedream, T., Chen, Y ., Gao, Y ., Gong, L., Guo, M., Guo, Q., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint (2025)

work page 2025
[43]

IEEE Transactions on Image Processing (TIP)15(2), 430–444 (2006)

Sheikh, H.R., Bovik, A.C.: Image information and visual quality. IEEE Transactions on Image Processing (TIP)15(2), 430–444 (2006)

work page 2006
[44]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)

Shen, S., Liang, J., Cai, C., Geng, C., Duan, H., Zhang, X., et al.: Agentic retoucher for text- to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)

work page 2026
[45]

IEEE Transactions on Multimedia (TMM) (2026)

Wang, J., Duan, H., Zhai, G., Min, X.: Quality assessment for ai generated images with in- struction tuning. IEEE Transactions on Multimedia (TMM) (2026)

work page 2026
[46]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Wang, J., Duan, H., Zhao, Y ., Wang, J., Zhai, G., Min, X.: Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 17312–17323 (2025)

work page 2025
[47]

I2i-bench: A comprehensive benchmark suite for image-to-image editing models.arXiv preprint arXiv:2512.04660, 2025

Wang, J., Wang, J., Duan, H., Kang, J., Zhai, G., Min, X.: I2i-bench: A comprehensive benchmark suite for image-to-image editing models. arXiv preprint arXiv:2512.04660 (2025)

work page arXiv 2025
[48]

arXiv preprint arXiv:2509.04545 (2025)

Wang, L., Xing, X., Cheng, Y ., Zhao, Z., Donghao, L., Tiankai, H., et al.: Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting. arXiv preprint arXiv:2509.04545 (2025)

work page arXiv 2025
[49]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

IEEE Transactions on Image Processing (TIP)13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP)13(4), 600–612 (2004)

work page 2004
[51]

Skywork unipic 3.0: Unified multi-image composition via sequence modeling.arXiv preprint arXiv:2601.15664, 2026

Wei, H., Liu, H., Wang, Z., Peng, Y ., Xu, B., Wu, S., et al.: Skywork unipic 3.0: Unified multi-image composition via sequence modeling. arXiv preprint arXiv:2601.15664 (2026)

work page arXiv 2026
[52]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y ., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

arXiv preprint arXiv:2312.17090 (2023)

Wu, H., Zhang, Z., Zhang, W., Chen, C., Li, C., Liao, L., et al.: Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023) 13

work page arXiv 2023
[55]

Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

Wu, K., Jiang, S., Ku, M., Nie, P., Liu, M., Chen, W.: Editreward: A human-aligned reward model for instruction-guided image editing. arXiv preprint arXiv:2509.26346 (2025)

work page arXiv 2025
[56]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Wu, Q., Bansal, G., Zhang, J., Wu, Y ., Li, B., Zhu, E., et al.: Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

arXiv preprint arXiv:2505.16707 (2025)

Wu, Y ., Li, Z., Hu, X., Ye, X., Zeng, X., Yu, G., et al.: Kris-bench: Benchmarking next-level intelligent image editing models. arXiv preprint arXiv:2505.16707 (2025)

work page arXiv 2025
[59]

arXiv preprint arXiv:2510.06679 (2025)

Xia, B., Peng, B., Zhang, Y ., Huang, J., Liu, J., Li, J., et al.: Dreamomni2: Multimodal instruction-based editing and generation. arXiv preprint arXiv:2510.06679 (2025)

work page arXiv 2025
[60]

Edithf-1m: A million-scale rich hu- man preference feedback for image editing.arXiv preprint arXiv:2603.14916, 2026

Xu, Z., Duan, H., Ji, Z., Zhang, X., Liu, Y ., Min, X., et al.: Edithf-1m: A million-scale rich human preference feedback for image editing. arXiv preprint arXiv:2603.14916 (2026)

work page arXiv 2026
[61]

In: Proceedings of the ACM International Conference on Multimedia (ACM MM)

Xu, Z., Duan, H., Liu, B., Ma, G., Wang, J., Yang, L., et al.: Lmm4edit: Benchmarking and evaluating multimodal image editing with lmms. In: Proceedings of the ACM International Conference on Multimedia (ACM MM). pp. 6908–6917 (October 2025)

work page 2025
[62]

In: IEEE International Confer- ence on Multimedia and Expo (ICME)

Xu, Z., Duan, H., Ma, G., Yang, L., Wang, J., Wu, Q., et al.: Harmonyiqa: Pioneering bench- mark and model for image harmonization quality assessment. In: IEEE International Confer- ence on Multimedia and Expo (ICME). pp. 1–6 (2025)

work page 2025
[63]

Xu, Z., Shen, D., Du, Y ., Hao, K., Huang, J., Huang, X.: Magicwand: A universal agent for generation and evaluation aligned with user preference (2025)

work page 2025
[64]

IEEE Transactions on Image Processing (TIP)23(2), 684–695 (2013)

Xue, W., Zhang, L., Mou, X., Bovik, A.C.: Gradient magnitude similarity deviation: A highly efficient perceptual image quality index. IEEE Transactions on Image Processing (TIP)23(2), 684–695 (2013)

work page 2013
[65]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

IEEE Transactions on Circuits and Systems for Video Technology (2025)

Yang, L., Duan, H., Wang, J., Liu, J., Hu, M., Min, X., Zhai, G., Le Callet, P.: Quality assessment and distortion-aware saliency prediction for ai-generated omnidirectional images. IEEE Transactions on Circuits and Systems for Video Technology (2025)

work page 2025
[67]

arXiv preprint arXiv:2602.09084 (2026)

Ye, R., Zhang, J., Liu, Z., Zhu, Z., Yang, S., Li, L., et al.: Agent banana: High-fidelity image editing with agentic thinking and tooling. arXiv preprint arXiv:2602.09084 (2026)

work page arXiv 2026
[68]

In: Proceedings of the Conference on Associa- tion for the Advancement of Artificial Intelligence (AAAI)

Yin, G., Wang, W., Yuan, Z., Han, C., Ji, W., Sun, S., et al.: Content-variant reference image quality assessment via knowledge distillation. In: Proceedings of the Conference on Associa- tion for the Advancement of Artificial Intelligence (AAAI). vol. 36, pp. 3134–3142 (2022)

work page 2022
[69]

In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2023)

Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y .: Magicbrush: A manually annotated dataset for instruction-guided image editing. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2023)

work page 2023
[70]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

work page 2018
[71]

In: 2011 IEEE CVPR

Zhao, Q., Cai, J.: Visual saliency detection by spatially weighted dissimilarity. In: 2011 IEEE CVPR. pp. 1241–1248. IEEE (2011)

work page 2011
[72]

arXiv preprint arXiv:2410.17809 (2024) 14

Zhu, K., Gu, J., You, Z., Qiao, Y ., Dong, C.: An intelligent agentic system for complex image restoration problems. arXiv preprint arXiv:2410.17809 (2024) 14

work page arXiv 2024