DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

Boxian Ai; Chenhao Huang; Fengjiao Chen; Jun Kuang; Shibo Hong; Wei Wang; Yixin Cao; Zhongyuan Peng

arxiv: 2602.23622 · v2 · pith:FQPFOYWAnew · submitted 2026-02-27 · 💻 cs.CV · cs.AI

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

Shibo Hong , Boxian Ai , Jun Kuang , Wei Wang , FengJiao Chen , Zhongyuan Peng , Chenhao Huang , Yixin Cao This is my paper

Pith reviewed 2026-05-21 12:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords image editinginstruction-based modelssmall-scale objectsbenchmark evaluationvisual consistencyinstruction followingIIEMs

0 comments

The pith

DLEBench reveals that instruction-based image editing models struggle with small objects occupying just 1 to 10 percent of an image area.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeepLookEditBench as the first dedicated benchmark for testing how well instruction-based image editing models handle small-scale objects. It builds a test set of 1889 samples across seven instruction types, with target objects kept deliberately small and including cases of occlusion and multiple objects. The authors also create a dual-mode evaluation protocol using Tool-driven and Oracle-guided Modes plus refined scoring rubrics for Instruction Following and Visual Consistency. When ten existing models are run on this benchmark, clear performance shortfalls appear in small-object editing.

Core claim

DLEBench is a benchmark of 1889 samples designed to evaluate small-scale object editing in instruction-based image editing models, supported by an evaluation protocol that uses Tool-driven and Oracle-guided Modes together with refined rubrics to reduce subjectivity in judging instruction adherence and visual consistency.

What carries the argument

The dual-mode evaluation framework consisting of Tool-driven and Oracle-guided Modes, which corrects misalignment between large multimodal model judges and human judgments on small-object edits.

If this is right

Existing instruction-based image editing models exhibit significant performance gaps when required to edit small objects.
Specialized benchmarks focused on small-scale edits are necessary to drive progress in precise local editing.
Complex scenarios such as partial occlusion and multi-object editing expose particular weaknesses in current models.
Refined rubrics and dual-mode judging improve agreement between automated scores and human assessments on this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Better small-object editing could support finer detail refinement in both AI-generated images and everyday photo editing workflows.
Training regimes that deliberately include small-scale edit examples might help close the gaps shown on DLEBench.
The benchmark structure could be reused to test editing of even smaller regions or additional instruction categories.

Load-bearing premise

The proposed evaluation protocol with refined score rubrics and dual-mode framework successfully minimizes subjectivity and aligns with human judgments on the constructed samples.

What would settle it

A side-by-side test in which human raters score the same set of small-object edits and the scores diverge substantially from those produced by the Tool-driven and Oracle-guided Modes would falsify the claim that the protocol aligns with human judgment.

read the original abstract

Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DLEBench targets a genuine gap in small-object editing for image models with a new 1889-sample testbed, but the dual-mode protocol's human alignment rests on thin validation.

read the letter

Hi, the main point is that this paper introduces DLEBench as the first benchmark aimed squarely at small-scale object edits in instruction-based image editing models. Objects are restricted to 1-10% of the image area across 1889 samples and seven instruction types, with added complexity from occlusions and multi-object cases. That focus is new relative to broader existing benchmarks that mostly test larger changes. The construction looks concrete and the authors lay out clear criteria for the samples, which is a practical step forward for measuring local precision that matters in real editing workflows. They also propose refined rubrics for instruction following and visual consistency plus a dual-mode setup with tool-driven and oracle-guided evaluation to reduce reliance on LMM judges alone. The results across ten models indicate clear performance shortfalls on these small edits, which aligns with the motivation. The softer part is the protocol's claimed alignment with human judgments. The description suggests a human study was done to address subjectivity, yet it appears limited without the usual quantitative checks such as correlation scores or inter-rater agreement on a solid subset of samples. If those numbers are absent or modest, the reported gaps become harder to attribute cleanly to model limits rather than evaluator choices. This work is mainly for computer vision researchers who build or evaluate image editing systems and need diagnostics for fine-grained local changes. A reader looking for new testbeds or evaluation ideas would get direct value from the sample details and protocol. It is worth sending to peer review because a specialized benchmark like this can help the field even if the scoring validation needs tightening.

Referee Report

1 major / 1 minor

Summary. The paper introduces DLEBench, the first benchmark dedicated to small-scale object editing in Instruction-based Image Editing Models (IIEMs). It comprises 1889 samples across seven instruction types, with target objects occupying 1-10% of the image area and covering scenarios such as partial occlusion and multi-object editing. The authors propose an evaluation protocol with refined score rubrics for Instruction Following and Visual Consistency, plus a dual-mode framework (Tool-driven and Oracle-guided Modes) intended to reduce subjectivity and correct LMM-as-a-Judge misalignment with human judgments. Empirical evaluation of 10 IIEMs reveals significant performance gaps, motivating the need for specialized benchmarks.

Significance. If the evaluation protocol is shown to be reliable, the benchmark would usefully highlight underexplored limitations in current IIEMs for precise local edits on small objects, potentially guiding targeted improvements in model architectures and training for fine-grained image manipulation tasks.

major comments (1)

The dual-mode evaluation framework is described as successfully minimizing subjectivity and aligning with human judgments on DLEBench. However, the manuscript reports no quantitative validation metrics (e.g., correlation coefficients, Cohen's kappa, or agreement rates) between protocol scores and human annotations on any subset of the 1889 samples. This is load-bearing for the central claim, because the reported performance gaps across the 10 IIEMs cannot be confidently attributed to model limitations without evidence that the rubrics and modes produce scores consistent with human consensus on Instruction Following and Visual Consistency.

minor comments (1)

The abstract and methods description would benefit from a brief table or breakdown showing the number of samples per instruction type and per scenario (e.g., occlusion vs. multi-object) to clarify the benchmark composition.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below and commit to revisions that strengthen the empirical validation of the proposed evaluation protocol.

read point-by-point responses

Referee: The dual-mode evaluation framework is described as successfully minimizing subjectivity and aligning with human judgments on DLEBench. However, the manuscript reports no quantitative validation metrics (e.g., correlation coefficients, Cohen's kappa, or agreement rates) between protocol scores and human annotations on any subset of the 1889 samples. This is load-bearing for the central claim, because the reported performance gaps across the 10 IIEMs cannot be confidently attributed to model limitations without evidence that the rubrics and modes produce scores consistent with human consensus on Instruction Following and Visual Consistency.

Authors: We agree that the absence of quantitative validation metrics represents a substantive gap in supporting the central claim. The manuscript describes the design of the refined rubrics for Instruction Following and Visual Consistency together with the dual-mode (Tool-driven and Oracle-guided) framework intended to reduce LMM misalignment, yet it does not report correlation coefficients, Cohen's kappa, or agreement rates against human annotations on any subset of the 1889 samples. This omission limits the strength of the argument that observed performance differences across the ten IIEMs can be confidently attributed to model limitations rather than evaluation artifacts. In the revised manuscript we will add a dedicated human validation study performed on a stratified subset of samples. We will report inter-annotator agreement (Cohen's kappa), correlation between protocol scores and mean human ratings (Pearson/Spearman), and raw agreement percentages separately for each criterion and mode. These results will be presented in a new subsection and table, directly addressing the load-bearing concern raised. revision: yes

Circularity Check

0 steps flagged

No circularity: direct benchmark construction and external model evaluation

full rationale

The paper constructs DLEBench (1889 samples across seven instruction types with small objects occupying 1-10% area) and proposes an evaluation protocol with refined rubrics plus dual-mode (Tool-driven/Oracle-guided) framework to address LMM misalignment. No mathematical derivations, fitted parameters, self-referential predictions, or load-bearing self-citations appear in the provided text. Results consist of empirical testing on 10 external IIEMs, with the protocol presented as a methodological contribution rather than a derived quantity that reduces to its own inputs by construction. This is a standard benchmark paper whose central claims rest on sample curation and human-aligned scoring rather than any closed loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper. No free parameters, mathematical axioms, or invented entities are introduced; the contribution rests on manual sample curation and rubric design rather than derivations or new theoretical constructs.

pith-pipeline@v0.9.0 · 5771 in / 1117 out tokens · 75369 ms · 2026-05-21T12:00:31.189450+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 10 internal anchors

[1]

Unireal: Universal image generation and editing via learning real-world dynamics

Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye Kim, Qing Liu, Yĳun Li, Jianming Zhang, Nanxuan Zhao, Yilin Wang, et al. Unireal: Universal image generation and editing via learning real-world dynamics. InProceedings ofthe ComputerVisionandPatternRecognitionConference, pages 12501–12511, 2025

work page 2025
[2]

Diffedit: Diffusion-based seman- tic image editing with mask guidance.arXiv preprint arXiv:2210.11427, 2022

GuillaumeCouairon, JakobVerbeek, HolgerSchwenk, andMatthieuCord. Diffedit: Diffusion-basedsemanticimage editing with mask guidance.http://arxiv.org/abs/2210.11427, 2022. doi: 10.48550/arxiv.2210.11427

work page doi:10.48550/arxiv.2210.11427 2022
[3]

Chatedit: Towards multi-turn interactive facial image editing via dialogue

Xing Cui, Zekun Li, Pei Li, Yibo Hu, Hailin Shi, Chunshui Cao, and Zhaofeng He. Chatedit: Towards multi-turn interactive facial image editing via dialogue. InProceedings of the 2023 Conference on Empirical Methods in NaturalLanguageProcessing, pages 14567–14583, 2023

work page 2023
[4]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Guidinginstruction-basedimage editing via multimodal large language models.arXiv preprintarXiv:2309.17102, 2023

Tsu-JuiFu,WenzeHu,XianzhiDu,WilliamYangWang,YinfeiYang,andZheGan. Guidinginstruction-basedimage editing via multimodal large language models.arXiv preprintarXiv:2309.17102, 2023

work page arXiv 2023
[6]

Instructdiffusion: Ageneralistmodelinginterfaceforvisiontasks

Zigang Geng, Binxin Yang, Tiankai Hang, Chen Li, Shuyang Gu, Ting Zhang, Jianmin Bao, Zheng Zhang, Houqiang Li,HanHu,etal. Instructdiffusion: Ageneralistmodelinginterfaceforvisiontasks. In ProceedingsoftheIEEE/CVF Conferenceon computervision and pattern recognition, pages 12709–12720, 2024

work page 2024
[7]

Mask-Guided Portrait Editing with Conditional GANs

Shuyang Gu, Jianmin Bao, Hao Yang, Dong Chen, Fang Wen, and Lu Yuan. Mask-guided portrait editing with conditional gans.http://arxiv.org/abs/1905.10346, 2022. doi: 10.48550/arxiv.1905.10346

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1905.10346 1905
[8]

Unireditbench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025

Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, et al. Unireditbench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025

work page arXiv 2025
[9]

Frabench and ufeval: Unified fine-grained evaluation with task and aspect generalization, 2025

Shibo Hong, Jiahao Ying, Haiyuan Liang, Mengdi Zhang, Jun Kuang, Jiazheng Zhang, and Yixin Cao. Frabench and ufeval: Unified fine-grained evaluation with task and aspect generalization, 2025. URLhttps://arxiv.org/abs/ 2505.12795

work page arXiv 2025
[10]

Paralleledits: Efficient multi-aspect text-driven image editing with attention grouping

Mingzhen Huang, Jialing Cai, Shan Jia, Vishnu Lokhande, and Siwei Lyu. Paralleledits: Efficient multi-aspect text-driven image editing with attention grouping. InAdvancesinNeuralInformationProcessingSystems, 2024

work page 2024
[11]

Instruction-based image editing with planning, reasoning, and generation

Liya Ji, Chenyang Qi, and Qifeng Chen. Instruction-based image editing with planning, reasoning, and generation. In Proceedings ofthe IEEE/CVF International ConferenceonComputerVision, pages 17506–17515, 2025

work page 2025
[12]

Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception.arXiv preprint arXiv:2506.17733, 2025

Mengqi Lei, Siqi Li, Yihong Wu, Han Hu, You Zhou, Xinhu Zheng, Guiguang Ding, Shaoyi Du, Zongze Wu, and Yue Gao. Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception.arXiv preprint arXiv:2506.17733, 2025

work page arXiv 2025
[13]

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, et al. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprintarXiv:2510.16888, 2025

work page internal anchor Pith review arXiv 2025
[14]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprintarXiv:2506.03147, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXivpreprintarXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

I2ebench: A comprehensive benchmark for instruction-based image editing.Advancesin NeuralInformation ProcessingSystems, 37:41494–41516, 2024

Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, and Rongrong Ji. I2ebench: A comprehensive benchmark for instruction-based image editing.Advancesin NeuralInformation ProcessingSystems, 37:41494–41516, 2024

work page 2024
[17]

Gie-bench: Towards grounded evaluation for text-guided image editing.arXiv preprintarXiv:2505.11493, 2025

Yusu Qian, Jiasen Lu, Tsu-Jui Fu, Xinze Wang, Chen Chen, Yinfei Yang, Wenze Hu, and Zhe Gan. Gie-bench: Towards grounded evaluation for text-guided image editing.arXiv preprintarXiv:2505.11493, 2025. 48

work page arXiv 2025
[18]

Grounding dino 1.5: Advance the" edge" of open-set object detection.arXiv preprint arXiv:2405.10300, 2024

Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the" edge" of open-set object detection.arXiv preprint arXiv:2405.10300, 2024

work page arXiv 2024
[19]

Emu edit: Precise image editing via recognition and generation tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings ofthe IEEE/CVF Conferenceon ComputerVisionand PatternRecognition, pages 8871–8879, 2024

work page 2024
[20]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXivpreprintarXiv:2505.15966, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Instructedit: Improving automatic masks for diffusion- based image editing with user instructions.http://hdl.handle.net/10754/692507, 2023

Qian Wang, Zhang Biao, Michael Birsak, and Peter Wonka. Instructedit: Improving automatic masks for diffusion- based image editing with user instructions.http://hdl.handle.net/10754/692507, 2023

work page 2023
[22]

Real-esrgan: Training real-world blind super-resolution with pure synthetic data

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InProceedings of the IEEE/CVF international conference on computer vision, pages 1905–1914, 2021

work page 1905
[23]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprintarXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

ChenyuanWu,PengfeiZheng,RuiranYan,ShitaoXiao,XinLuo,YuezeWang,WanliLi,XiyanJiang,YexinLiu,Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXivpreprintarXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

V*: Guided visual search as a core mechanism in multimodal llms.arXiv preprint arXiv:2312.14135, 2023

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms.arXivpreprint arXiv:2312.14135, 2023

work page arXiv 2023
[26]

Kris-bench: Benchmarking next-level intelligent image editing models.arXivpreprint arXiv:2505.16707, 2025

Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming- Hsuan Yang, and Xu Yang. Kris-bench: Benchmarking next-level intelligent image editing models.arXivpreprint arXiv:2505.16707, 2025

work page arXiv 2025
[27]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXivpreprintarXiv:2505.20275, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Anyedit: Mastering unified high-quality image editing for any idea

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the ComputerVisionandPatternRecognitionConference, pages 26125–26135, 2025

work page 2025
[29]

Anyedit: Mastering unified high-quality image editing for any idea, 2025

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea, 2025. URL https://arxiv.org/abs/2411.15738

work page arXiv 2025
[30]

Magicbrush: A manually annotated dataset for instruction-guided image editing

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing. InAdvancesin NeuralInformationProcessingSystems, 2023

work page 2023
[31]

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXivpreprintarXiv:2408.13257, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing

Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing. arXiv preprintarXiv:2504.02826, 2025

work page arXiv 2025
[33]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural informationprocessingsystems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural informationprocessingsystems, 36:46595–46623, 2023. 49

work page 2023

[1] [1]

Unireal: Universal image generation and editing via learning real-world dynamics

Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye Kim, Qing Liu, Yĳun Li, Jianming Zhang, Nanxuan Zhao, Yilin Wang, et al. Unireal: Universal image generation and editing via learning real-world dynamics. InProceedings ofthe ComputerVisionandPatternRecognitionConference, pages 12501–12511, 2025

work page 2025

[2] [2]

Diffedit: Diffusion-based seman- tic image editing with mask guidance.arXiv preprint arXiv:2210.11427, 2022

GuillaumeCouairon, JakobVerbeek, HolgerSchwenk, andMatthieuCord. Diffedit: Diffusion-basedsemanticimage editing with mask guidance.http://arxiv.org/abs/2210.11427, 2022. doi: 10.48550/arxiv.2210.11427

work page doi:10.48550/arxiv.2210.11427 2022

[3] [3]

Chatedit: Towards multi-turn interactive facial image editing via dialogue

Xing Cui, Zekun Li, Pei Li, Yibo Hu, Hailin Shi, Chunshui Cao, and Zhaofeng He. Chatedit: Towards multi-turn interactive facial image editing via dialogue. InProceedings of the 2023 Conference on Empirical Methods in NaturalLanguageProcessing, pages 14567–14583, 2023

work page 2023

[4] [4]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Guidinginstruction-basedimage editing via multimodal large language models.arXiv preprintarXiv:2309.17102, 2023

Tsu-JuiFu,WenzeHu,XianzhiDu,WilliamYangWang,YinfeiYang,andZheGan. Guidinginstruction-basedimage editing via multimodal large language models.arXiv preprintarXiv:2309.17102, 2023

work page arXiv 2023

[6] [6]

Instructdiffusion: Ageneralistmodelinginterfaceforvisiontasks

Zigang Geng, Binxin Yang, Tiankai Hang, Chen Li, Shuyang Gu, Ting Zhang, Jianmin Bao, Zheng Zhang, Houqiang Li,HanHu,etal. Instructdiffusion: Ageneralistmodelinginterfaceforvisiontasks. In ProceedingsoftheIEEE/CVF Conferenceon computervision and pattern recognition, pages 12709–12720, 2024

work page 2024

[7] [7]

Mask-Guided Portrait Editing with Conditional GANs

Shuyang Gu, Jianmin Bao, Hao Yang, Dong Chen, Fang Wen, and Lu Yuan. Mask-guided portrait editing with conditional gans.http://arxiv.org/abs/1905.10346, 2022. doi: 10.48550/arxiv.1905.10346

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1905.10346 1905

[8] [8]

Unireditbench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025

Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, et al. Unireditbench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025

work page arXiv 2025

[9] [9]

Frabench and ufeval: Unified fine-grained evaluation with task and aspect generalization, 2025

Shibo Hong, Jiahao Ying, Haiyuan Liang, Mengdi Zhang, Jun Kuang, Jiazheng Zhang, and Yixin Cao. Frabench and ufeval: Unified fine-grained evaluation with task and aspect generalization, 2025. URLhttps://arxiv.org/abs/ 2505.12795

work page arXiv 2025

[10] [10]

Paralleledits: Efficient multi-aspect text-driven image editing with attention grouping

Mingzhen Huang, Jialing Cai, Shan Jia, Vishnu Lokhande, and Siwei Lyu. Paralleledits: Efficient multi-aspect text-driven image editing with attention grouping. InAdvancesinNeuralInformationProcessingSystems, 2024

work page 2024

[11] [11]

Instruction-based image editing with planning, reasoning, and generation

Liya Ji, Chenyang Qi, and Qifeng Chen. Instruction-based image editing with planning, reasoning, and generation. In Proceedings ofthe IEEE/CVF International ConferenceonComputerVision, pages 17506–17515, 2025

work page 2025

[12] [12]

Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception.arXiv preprint arXiv:2506.17733, 2025

Mengqi Lei, Siqi Li, Yihong Wu, Han Hu, You Zhou, Xinhu Zheng, Guiguang Ding, Shaoyi Du, Zongze Wu, and Yue Gao. Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception.arXiv preprint arXiv:2506.17733, 2025

work page arXiv 2025

[13] [13]

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, et al. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprintarXiv:2510.16888, 2025

work page internal anchor Pith review arXiv 2025

[14] [14]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprintarXiv:2506.03147, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXivpreprintarXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

I2ebench: A comprehensive benchmark for instruction-based image editing.Advancesin NeuralInformation ProcessingSystems, 37:41494–41516, 2024

Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, and Rongrong Ji. I2ebench: A comprehensive benchmark for instruction-based image editing.Advancesin NeuralInformation ProcessingSystems, 37:41494–41516, 2024

work page 2024

[17] [17]

Gie-bench: Towards grounded evaluation for text-guided image editing.arXiv preprintarXiv:2505.11493, 2025

Yusu Qian, Jiasen Lu, Tsu-Jui Fu, Xinze Wang, Chen Chen, Yinfei Yang, Wenze Hu, and Zhe Gan. Gie-bench: Towards grounded evaluation for text-guided image editing.arXiv preprintarXiv:2505.11493, 2025. 48

work page arXiv 2025

[18] [18]

Grounding dino 1.5: Advance the" edge" of open-set object detection.arXiv preprint arXiv:2405.10300, 2024

Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the" edge" of open-set object detection.arXiv preprint arXiv:2405.10300, 2024

work page arXiv 2024

[19] [19]

Emu edit: Precise image editing via recognition and generation tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings ofthe IEEE/CVF Conferenceon ComputerVisionand PatternRecognition, pages 8871–8879, 2024

work page 2024

[20] [20]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXivpreprintarXiv:2505.15966, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Instructedit: Improving automatic masks for diffusion- based image editing with user instructions.http://hdl.handle.net/10754/692507, 2023

Qian Wang, Zhang Biao, Michael Birsak, and Peter Wonka. Instructedit: Improving automatic masks for diffusion- based image editing with user instructions.http://hdl.handle.net/10754/692507, 2023

work page 2023

[22] [22]

Real-esrgan: Training real-world blind super-resolution with pure synthetic data

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InProceedings of the IEEE/CVF international conference on computer vision, pages 1905–1914, 2021

work page 1905

[23] [23]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprintarXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

ChenyuanWu,PengfeiZheng,RuiranYan,ShitaoXiao,XinLuo,YuezeWang,WanliLi,XiyanJiang,YexinLiu,Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXivpreprintarXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

V*: Guided visual search as a core mechanism in multimodal llms.arXiv preprint arXiv:2312.14135, 2023

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms.arXivpreprint arXiv:2312.14135, 2023

work page arXiv 2023

[26] [26]

Kris-bench: Benchmarking next-level intelligent image editing models.arXivpreprint arXiv:2505.16707, 2025

Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming- Hsuan Yang, and Xu Yang. Kris-bench: Benchmarking next-level intelligent image editing models.arXivpreprint arXiv:2505.16707, 2025

work page arXiv 2025

[27] [27]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXivpreprintarXiv:2505.20275, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Anyedit: Mastering unified high-quality image editing for any idea

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the ComputerVisionandPatternRecognitionConference, pages 26125–26135, 2025

work page 2025

[29] [29]

Anyedit: Mastering unified high-quality image editing for any idea, 2025

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea, 2025. URL https://arxiv.org/abs/2411.15738

work page arXiv 2025

[30] [30]

Magicbrush: A manually annotated dataset for instruction-guided image editing

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing. InAdvancesin NeuralInformationProcessingSystems, 2023

work page 2023

[31] [31]

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXivpreprintarXiv:2408.13257, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing

Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing. arXiv preprintarXiv:2504.02826, 2025

work page arXiv 2025

[33] [33]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural informationprocessingsystems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural informationprocessingsystems, 36:46595–46623, 2023. 49

work page 2023