Recognition: 2 theorem links
· Lean TheoremEditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
Pith reviewed 2026-05-11 02:16 UTC · model grok-4.3
The pith
EditRefiner uses a four-agent perception-reasoning-action-evaluation loop and a new human-feedback dataset to refine text-guided image edits with better localization and perceptual alignment than prior methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By reformulating post-editing correction as an explicit human-like perception-reasoning-action-evaluation loop and grounding it in the EditFHF-15K dataset of fine-grained human feedback, EditRefiner achieves more reliable detection of artifacts, more accurate diagnostic inference, and more precise localized re-editing without introducing new semantic drift.
What carries the argument
The four-agent loop consisting of a perception agent that outputs contextual saliency maps of artifacts and failures, a reasoning agent that performs diagnostic inference from those maps, an action agent that plans and executes localized re-editing, and an evaluation agent that decides whether further refinement is required.
Load-bearing premise
The four specialized agents, when guided by cues from the EditFHF-15K dataset, can reliably diagnose and correct localized editing problems without creating new artifacts or changing the intended meaning of the edit.
What would settle it
A held-out test set of editing failures from models or tasks outside the 12 TIE models used to build EditFHF-15K, where EditRefiner either fails to improve mean opinion scores or introduces measurable semantic changes compared with the original edit.
Figures
read the original abstract
Recent text-guided image editing (TIE) models have made remarkable progress, yet edited images still frequently suffer from fine-grained issues such as unnatural objects, lighting mismatch, and unexpected changes. Existing refinement approaches either rely on costly iterative regeneration or employ vision-language models (VLMs) with weak spatial grounding, often resulting in semantic drift and unreliable local corrections. To address these limitations, we first construct EditFHF-15K, a dataset of fine-grained human feedback for edited images, comprising (1) 15K images from 12 TIE models spanning 43 editing tasks, (2) 60K annotated artifact regions and 80K editing failure regions, each accompanied by textual reasoning, and (3) 45K mean opinion scores (MOSs) assessing perceptual quality, instruction following, and visual consistency. Based on EditFHF-15K, we propose EditRefiner, a hierarchical, interpretable, and human-aligned agentic framework that reformulates post-editing correction as a human-like perception-reasoning-action-evaluation loop. Specifically, we introduce: (1) a perception agent that detects contextual saliency maps of artifacts and editing failures, (2) a reasoning agent that interprets these perceptual cues to perform human-aligned diagnostic inference, (3) an action agent that uses the reasoning output to plan and execute localized re-editing, and (4) an evaluation agent that assesses the re-edited image and guides the action agent on whether further refinements are required. Extensive experiments demonstrate that EditRefiner consistently outperforms state-of-the-art methods in distortion localization, diagnose accuracy and human perception alignment, establishing a new paradigm for self-corrective and perceptually reliable image editing. The code is available at https://github.com/IntMeGroup/EditRefiner.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper constructs EditFHF-15K, a dataset of 15K edited images from 12 TIE models across 43 tasks, with 60K annotated artifact regions, 80K failure regions, and 45K MOS scores. It proposes EditRefiner, a hierarchical agentic framework implementing a perception-reasoning-action-evaluation loop: a perception agent generates contextual saliency maps, a reasoning agent performs diagnostic inference, an action agent executes localized re-editing, and an evaluation agent assesses quality and decides on further iterations. The central claim is that this framework consistently outperforms prior refinement methods in distortion localization, diagnostic accuracy, and human perception alignment.
Significance. If the empirical results hold under rigorous validation, the work could introduce a practical self-corrective paradigm for text-guided image editing that mitigates semantic drift and weak spatial grounding in VLMs. The public code release and dataset construction from human feedback are strengths that support reproducibility and future benchmarking in the field.
major comments (2)
- [Experiments] The central claim of consistent outperformance in distortion localization and diagnostic accuracy rests on the four-agent loop reliably correcting artifacts without introducing new failures or semantic drift. This is load-bearing but under-supported if the action agent (which must translate reasoning into precise localized edits) inherits the spatial grounding weaknesses of the underlying VLMs, as the introduction itself acknowledges; the experiments section should include targeted failure-case analysis or quantitative checks for new artifact introduction post-refinement.
- [Dataset Construction and Experiments] EditFHF-15K is used both to drive the agents (via annotated regions and reasoning) and to measure success in localization/diagnosis accuracy. This creates a potential circularity risk where reported gains are partly dataset-specific rather than generalizable; the paper should clarify train/test splits, whether agents see held-out failure modes, and include cross-dataset or cross-model generalization results.
minor comments (2)
- [Abstract] The abstract asserts 'extensive experiments' and 'consistent outperformance' but supplies no key quantitative metrics, baseline names, or effect sizes; adding 1-2 headline numbers (e.g., localization IoU or accuracy deltas) would improve readability.
- [Method] Notation for the agent loop (perception saliency maps feeding into reasoning) could be clarified with a diagram or pseudocode in §3 to make the hierarchical flow explicit.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully addressed each major comment below and revised the paper to incorporate the suggested analyses and clarifications.
read point-by-point responses
-
Referee: [Experiments] The central claim of consistent outperformance in distortion localization and diagnostic accuracy rests on the four-agent loop reliably correcting artifacts without introducing new failures or semantic drift. This is load-bearing but under-supported if the action agent (which must translate reasoning into precise localized edits) inherits the spatial grounding weaknesses of the underlying VLMs, as the introduction itself acknowledges; the experiments section should include targeted failure-case analysis or quantitative checks for new artifact introduction post-refinement.
Authors: We agree that explicit verification of whether the action agent introduces new artifacts is necessary to fully support the central claim, given the acknowledged spatial grounding limitations of VLMs. While the original experiments reported net gains in localization accuracy and human perception metrics, they did not include dedicated checks for post-refinement artifact introduction. In the revised manuscript, we have added a new subsection in the Experiments section with targeted failure-case analysis. This includes qualitative examples of introduced artifacts and quantitative metrics comparing the number of annotated artifact regions before and after refinement on the held-out test data. These results indicate that new artifacts are introduced in a small minority of cases and do not offset the overall improvements. revision: yes
-
Referee: [Dataset Construction and Experiments] EditFHF-15K is used both to drive the agents (via annotated regions and reasoning) and to measure success in localization/diagnosis accuracy. This creates a potential circularity risk where reported gains are partly dataset-specific rather than generalizable; the paper should clarify train/test splits, whether agents see held-out failure modes, and include cross-dataset or cross-model generalization results.
Authors: We thank the referee for identifying this potential circularity concern. The EditFHF-15K dataset was constructed with a predefined 80/20 train/test split before any agent development. All agents (perception, reasoning, action, and evaluation) were trained and tuned exclusively on the training split using the annotated regions and reasoning, while all quantitative results for localization accuracy, diagnostic accuracy, and human alignment were computed solely on the held-out test split. No test data was used during agent training or hyperparameter selection. To further address generalizability, we have added cross-model experiments evaluating the framework on three additional TIE models not present in the original 12, as well as cross-dataset results on an external public image editing benchmark. These new results are included in the revised Experiments section. revision: yes
Circularity Check
No significant circularity; empirical framework is self-contained
full rationale
The paper constructs EditFHF-15K as a new dataset of human feedback and then defines EditRefiner as a four-agent hierarchical loop (perception saliency maps, diagnostic reasoning, localized re-editing, evaluation-guided iteration). The central claims are empirical outperformance on distortion localization, diagnostic accuracy, and human perception alignment versus prior methods. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs. The dataset is used for both agent development and evaluation, which is standard ML practice and does not meet the criteria for circularity (no quoted self-definitional reduction or load-bearing self-citation of a uniqueness theorem). The framework is presented as a design choice rather than a derivation that collapses to its own definitions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Human mean opinion scores and region annotations in the constructed dataset accurately reflect perceptual quality, instruction following, and visual consistency
- domain assumption Vision-language models possess sufficient spatial grounding when guided by the proposed perception and reasoning agents
invented entities (1)
-
Perception-reasoning-action-evaluation agent loop
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearhierarchical, interpretable, and human-aligned agentic framework that reformulates post-editing correction as a human-like perception-reasoning-action-evaluation loop
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearperception agent that detects contextual saliency maps of artifacts and editing failures
Reference graph
Works this paper leans on
-
[1]
Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Ex- trinsic Evaluation Measures for Machine Translation and/or Summarization (ACL Workshop). pp. 65–72 (Jun 2005)
work page 2005
-
[2]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing in- structions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18392–18402 (2023)
work page 2023
-
[3]
In: Weiss, Y ., Schölkopf, B., Platt, J
Bruce, N., Tsotsos, J.: Saliency based on information maximization. In: Weiss, Y ., Schölkopf, B., Platt, J. (eds.) NIPS. vol. 18. MIT Press (2005),https://proceedings.neurips.cc/ paper_files/paper/2005/file/0738069b244a1c43c83112b735140a16-Paper.pdf
-
[4]
Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics tell us about saliency models? IEEE transactions on pattern analysis and machine intelligence (TPAMI)41(3), 740–757 (2019)
work page 2019
-
[5]
In: Proceedings of the IEEE/CVF international conference on computer vision (CVPR)
Cao, M., Wang, X., Qi, Z., Shan, Y ., Qie, X., Zheng, Y .: Masactrl: Tuning-free mutual self- attention control for consistent image synthesis and editing. In: Proceedings of the IEEE/CVF international conference on computer vision (CVPR). pp. 22560–22570 (2023)
work page 2023
-
[6]
In: Platt, J., Koller, D., Singer, Y ., Roweis, S
Cerf, M., Harel, J., Einhaeuser, W., Koch, C.: Predicting human gaze using low-level saliency combined with face detection. In: Platt, J., Koller, D., Singer, Y ., Roweis, S. (eds.) NIPS. vol. 20. Curran Associates, Inc. (2007),https://proceedings.neurips.cc/ paper_files/paper/2007/file/708f3cf8100d5e71834b1db77dfa15d6-Paper.pdf
work page 2007
-
[7]
In: Proceedings of the International Conference on Learning Representations (ICLR)
Chan, C.M., Chen, W., Su, Y ., Yu, J., Xue, W., Zhang, S., et al.: Chateval: Towards better llm- based evaluators through multi-agent debate. In: Proceedings of the International Conference on Learning Representations (ICLR). pp. 1–9 (2023)
work page 2023
-
[8]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: A Deep Multi-Level Network for Saliency Prediction. In: ICPR (2016)
work page 2016
-
[10]
IEEE Transactions on Image Processing27(10), 5142–5154 (2018)
Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model. IEEE Transactions on Image Processing27(10), 5142–5154 (2018)
work page 2018
-
[11]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)
Duan, H., Hu, Q., Wang, J., Yang, L., Xu, Z., Liu, L., Min, X., Cai, C., Ye, T., Zhang, X., Zhai, G.: Finevq: Fine-grained user generated content video quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)
work page 2025
-
[12]
In: Proceedings of the International Conference on Machine Learning (ICML) (2024)
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Proceedings of the International Conference on Machine Learning (ICML) (2024)
work page 2024
-
[13]
In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2021)
Gao, T., Yao, X., Chen, D.: SimCSE: Simple contrastive learning of sentence embeddings. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2021)
work page 2021
-
[14]
In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Goferman, S., Zelnik-Manor, L., Tal, A.: Context-aware saliency detection. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp. 2376–2383 (2010). https://doi.org/10.1109/CVPR.2010.5539929
-
[15]
Google DeepMind: Gemini 3.1 pro: Best for complex tasks and bringing creative concepts to life.https://deepmind.google/models/gemini/pro/(2025)
work page 2025
-
[16]
In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016) 11
work page 2016
-
[17]
In: Proceedings of the International Conference on Learning Representations (ICLR) (2022)
Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., et al.: Lora: Low-rank adap- tation of large language models. In: Proceedings of the International Conference on Learning Representations (ICLR) (2022)
work page 2022
-
[18]
Understanding the planning of LLM agents: A survey
Huang, X., Liu, W., Chen, X., Wang, X., Wang, H., Lian, D., Wang, Y ., Tang, R., Chen, E.: Understanding the planning of llm agents: A survey. arXiv preprint arXiv:2402.02716 (2024)
work page internal anchor Pith review arXiv 2024
-
[19]
Huang, X., Shen, C., Boix, X., Zhao, Q.: Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In: ICCV (December 2015)
work page 2015
-
[20]
arXiv preprint arXiv:2404.09990 , year=
Hui, M., Yang, S., Zhao, B., Shi, Y ., Wang, H., Wang, P., et al.: Hq-edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990 (2024)
-
[21]
(ITU), I.T.U.: Methodology for the subjective assessment of the quality of television pictures. Tech. Rep. Rec. ITU-R BT.500-13, International Telecommunication Union (ITU) (Jan 2012)
work page 2012
-
[22]
In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)
Ju, X., Zeng, A., Bian, Y ., Liu, S., Xu, Q.: Pnp inversion: Boosting diffusion-based editing with 3 lines of code. In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)
work page 2024
-
[23]
In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)
Kulikov, V ., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion-free text-based editing using pre-trained flow models. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). pp. 19721–19730 (2025)
work page 2025
-
[24]
Lao, S., Gong, Y ., Shi, S., Yang, S., Wu, T., Wang, J., et al.: Attentions help cnns see better: Attention-based hybrid image quality assessment network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 1140–1149 (2022)
work page 2022
-
[25]
Li, H., Zhang, M., Zheng, D., Guo, Z., Jia, Y ., Feng, K., et al.: Editthinker: Unlocking iterative reasoning for any image editor. arXiv preprint arXiv:2512.05965 (2025)
-
[26]
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y ., Wang, R., et al.: Encouraging divergent think- ing in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118 (2023)
work page internal anchor Pith review arXiv 2023
-
[27]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Liang, Y ., He, J., Li, G., Li, P., Klimovskiy, A., Carolan, N., Sun, J., Pont-Tuset, J., Young, S., Yang, F., et al.: Rich human feedback for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19401– 19411 (2024)
work page 2024
-
[28]
In: Text Summarization Branches Out
Lin, C.Y .: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out. pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (Jul 2004),https://aclanthology.org/W04-1013/
work page 2004
-
[29]
IEEE Journal of Selected Topics in Signal Processing (2025)
Liu, L., Cai, C., Shen, S., Liang, J., Ouyang, W., Ye, T., Mao, J., Duan, H., Yao, J., Zhang, X., et al.: Moa-vr: A mixture-of-agents system towards all-in-one video restoration. IEEE Journal of Selected Topics in Signal Processing (2025)
work page 2025
-
[30]
Step1X-Edit: A Practical Framework for General Image Editing
Liu, S., Han, Y ., Xing, P., Yin, F., Wang, R., Cheng, W., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025)
work page internal anchor Pith review arXiv 2025
-
[31]
Neurocomputing507, 250–264 (2022)
Lou, J., Ma, L., Hu, K.X., Yang, H., Lin, W.Y .: Transalnet: Towards perceptually relevant visual saliency prediction. Neurocomputing507, 250–264 (2022)
work page 2022
-
[32]
Lu, S., Li, Y ., Xia, Y ., Hu, Y ., Zhao, S., Ma, Y ., et al.: Ovis2.5 technical report. arXiv:2508.11737 (2025)
-
[33]
Luo, X., Wang, J., Wu, C., Xiao, S., Jiang, X., Lian, D., et al.: Editscore: Unlocking online rl for image editing via high-fidelity reward modeling. arXiv preprint arXiv:2509.23909 (2025)
-
[34]
arXiv preprint arXiv:2508.02363 (2025) 12
Lupascu, M., Stupariu, M.S.: Optimal transport for rectified flow image editing: Unifying inversion-based and direct methods. arXiv preprint arXiv:2508.02363 (2025) 12
-
[35]
Mao, C., Zhang, J., Pan, Y ., Jiang, Z., Han, Z., Liu, Y ., Zhou, J.: Ace++: Instruction-based image creation and editing via context-aware content filling. arXiv preprint arXiv:2501.02487 (2025)
-
[36]
Efficient Estimation of Word Representations in Vector Space
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[37]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Nam, H., Kwon, G., Park, G.Y ., Ye, J.C.: Contrastive denoising score for text-guided latent diffusion image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9192–9201 (June 2024)
work page 2024
-
[38]
OpenAI: Chatgpt 5.https://openai.com/gpt-5/(2025)
work page 2025
-
[39]
OpenAI: Gpt-5.https://www.openai.com(2025)
work page 2025
-
[40]
Qian, Y ., Bocek-Rivele, E., Song, L., Tong, J., Yang, Y ., Lu, J., et al.: Pico-banana-400k: A large-scale dataset for text-guided image editing. arXiv preprint arXiv:2510.19808 (2025)
-
[41]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthe- sis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684–10695 (2022)
work page 2022
-
[42]
Seedream, T., Chen, Y ., Gao, Y ., Gong, L., Guo, M., Guo, Q., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint (2025)
work page 2025
-
[43]
IEEE Transactions on Image Processing (TIP)15(2), 430–444 (2006)
Sheikh, H.R., Bovik, A.C.: Image information and visual quality. IEEE Transactions on Image Processing (TIP)15(2), 430–444 (2006)
work page 2006
-
[44]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)
Shen, S., Liang, J., Cai, C., Geng, C., Duan, H., Zhang, X., et al.: Agentic retoucher for text- to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)
work page 2026
-
[45]
IEEE Transactions on Multimedia (TMM) (2026)
Wang, J., Duan, H., Zhai, G., Min, X.: Quality assessment for ai generated images with in- struction tuning. IEEE Transactions on Multimedia (TMM) (2026)
work page 2026
-
[46]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Wang, J., Duan, H., Zhao, Y ., Wang, J., Zhai, G., Min, X.: Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 17312–17323 (2025)
work page 2025
-
[47]
Wang, J., Wang, J., Duan, H., Kang, J., Zhai, G., Min, X.: I2i-bench: A comprehensive benchmark suite for image-to-image editing models. arXiv preprint arXiv:2512.04660 (2025)
-
[48]
arXiv preprint arXiv:2509.04545 (2025)
Wang, L., Xing, X., Cheng, Y ., Zhao, Z., Donghao, L., Tiankai, H., et al.: Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting. arXiv preprint arXiv:2509.04545 (2025)
-
[49]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
IEEE Transactions on Image Processing (TIP)13(4), 600–612 (2004)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP)13(4), 600–612 (2004)
work page 2004
-
[51]
Wei, H., Liu, H., Wang, Z., Peng, Y ., Xu, B., Wu, S., et al.: Skywork unipic 3.0: Unified multi-image composition via sequence modeling. arXiv preprint arXiv:2601.15664 (2026)
-
[52]
Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y ., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
arXiv preprint arXiv:2312.17090 (2023)
Wu, H., Zhang, Z., Zhang, W., Chen, C., Li, C., Liao, L., et al.: Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023) 13
-
[55]
Wu, K., Jiang, S., Ku, M., Nie, P., Liu, M., Chen, W.: Editreward: A human-aligned reward model for instruction-guided image editing. arXiv preprint arXiv:2509.26346 (2025)
-
[56]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Wu, Q., Bansal, G., Zhang, J., Wu, Y ., Li, B., Zhu, E., et al.: Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
arXiv preprint arXiv:2505.16707 (2025)
Wu, Y ., Li, Z., Hu, X., Ye, X., Zeng, X., Yu, G., et al.: Kris-bench: Benchmarking next-level intelligent image editing models. arXiv preprint arXiv:2505.16707 (2025)
-
[59]
arXiv preprint arXiv:2510.06679 (2025)
Xia, B., Peng, B., Zhang, Y ., Huang, J., Liu, J., Li, J., et al.: Dreamomni2: Multimodal instruction-based editing and generation. arXiv preprint arXiv:2510.06679 (2025)
-
[60]
Xu, Z., Duan, H., Ji, Z., Zhang, X., Liu, Y ., Min, X., et al.: Edithf-1m: A million-scale rich human preference feedback for image editing. arXiv preprint arXiv:2603.14916 (2026)
-
[61]
In: Proceedings of the ACM International Conference on Multimedia (ACM MM)
Xu, Z., Duan, H., Liu, B., Ma, G., Wang, J., Yang, L., et al.: Lmm4edit: Benchmarking and evaluating multimodal image editing with lmms. In: Proceedings of the ACM International Conference on Multimedia (ACM MM). pp. 6908–6917 (October 2025)
work page 2025
-
[62]
In: IEEE International Confer- ence on Multimedia and Expo (ICME)
Xu, Z., Duan, H., Ma, G., Yang, L., Wang, J., Wu, Q., et al.: Harmonyiqa: Pioneering bench- mark and model for image harmonization quality assessment. In: IEEE International Confer- ence on Multimedia and Expo (ICME). pp. 1–6 (2025)
work page 2025
-
[63]
Xu, Z., Shen, D., Du, Y ., Hao, K., Huang, J., Huang, X.: Magicwand: A universal agent for generation and evaluation aligned with user preference (2025)
work page 2025
-
[64]
IEEE Transactions on Image Processing (TIP)23(2), 684–695 (2013)
Xue, W., Zhang, L., Mou, X., Bovik, A.C.: Gradient magnitude similarity deviation: A highly efficient perceptual image quality index. IEEE Transactions on Image Processing (TIP)23(2), 684–695 (2013)
work page 2013
-
[65]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
IEEE Transactions on Circuits and Systems for Video Technology (2025)
Yang, L., Duan, H., Wang, J., Liu, J., Hu, M., Min, X., Zhai, G., Le Callet, P.: Quality assessment and distortion-aware saliency prediction for ai-generated omnidirectional images. IEEE Transactions on Circuits and Systems for Video Technology (2025)
work page 2025
-
[67]
arXiv preprint arXiv:2602.09084 (2026)
Ye, R., Zhang, J., Liu, Z., Zhu, Z., Yang, S., Li, L., et al.: Agent banana: High-fidelity image editing with agentic thinking and tooling. arXiv preprint arXiv:2602.09084 (2026)
-
[68]
Yin, G., Wang, W., Yuan, Z., Han, C., Ji, W., Sun, S., et al.: Content-variant reference image quality assessment via knowledge distillation. In: Proceedings of the Conference on Associa- tion for the Advancement of Artificial Intelligence (AAAI). vol. 36, pp. 3134–3142 (2022)
work page 2022
-
[69]
In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2023)
Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y .: Magicbrush: A manually annotated dataset for instruction-guided image editing. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2023)
work page 2023
-
[70]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
work page 2018
-
[71]
Zhao, Q., Cai, J.: Visual saliency detection by spatially weighted dissimilarity. In: 2011 IEEE CVPR. pp. 1241–1248. IEEE (2011)
work page 2011
-
[72]
arXiv preprint arXiv:2410.17809 (2024) 14
Zhu, K., Gu, J., You, Z., Qiao, Y ., Dong, C.: An intelligent agentic system for complex image restoration problems. arXiv preprint arXiv:2410.17809 (2024) 14
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.