pith. sign in

arxiv: 2602.23622 · v2 · pith:FQPFOYWAnew · submitted 2026-02-27 · 💻 cs.CV · cs.AI

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

Pith reviewed 2026-05-21 12:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords image editinginstruction-based modelssmall-scale objectsbenchmark evaluationvisual consistencyinstruction followingIIEMs
0
0 comments X

The pith

DLEBench reveals that instruction-based image editing models struggle with small objects occupying just 1 to 10 percent of an image area.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeepLookEditBench as the first dedicated benchmark for testing how well instruction-based image editing models handle small-scale objects. It builds a test set of 1889 samples across seven instruction types, with target objects kept deliberately small and including cases of occlusion and multiple objects. The authors also create a dual-mode evaluation protocol using Tool-driven and Oracle-guided Modes plus refined scoring rubrics for Instruction Following and Visual Consistency. When ten existing models are run on this benchmark, clear performance shortfalls appear in small-object editing.

Core claim

DLEBench is a benchmark of 1889 samples designed to evaluate small-scale object editing in instruction-based image editing models, supported by an evaluation protocol that uses Tool-driven and Oracle-guided Modes together with refined rubrics to reduce subjectivity in judging instruction adherence and visual consistency.

What carries the argument

The dual-mode evaluation framework consisting of Tool-driven and Oracle-guided Modes, which corrects misalignment between large multimodal model judges and human judgments on small-object edits.

If this is right

  • Existing instruction-based image editing models exhibit significant performance gaps when required to edit small objects.
  • Specialized benchmarks focused on small-scale edits are necessary to drive progress in precise local editing.
  • Complex scenarios such as partial occlusion and multi-object editing expose particular weaknesses in current models.
  • Refined rubrics and dual-mode judging improve agreement between automated scores and human assessments on this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Better small-object editing could support finer detail refinement in both AI-generated images and everyday photo editing workflows.
  • Training regimes that deliberately include small-scale edit examples might help close the gaps shown on DLEBench.
  • The benchmark structure could be reused to test editing of even smaller regions or additional instruction categories.

Load-bearing premise

The proposed evaluation protocol with refined score rubrics and dual-mode framework successfully minimizes subjectivity and aligns with human judgments on the constructed samples.

What would settle it

A side-by-side test in which human raters score the same set of small-object edits and the scores diverge substantially from those produced by the Tool-driven and Oracle-guided Modes would falsify the claim that the protocol aligns with human judgment.

read the original abstract

Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces DLEBench, the first benchmark dedicated to small-scale object editing in Instruction-based Image Editing Models (IIEMs). It comprises 1889 samples across seven instruction types, with target objects occupying 1-10% of the image area and covering scenarios such as partial occlusion and multi-object editing. The authors propose an evaluation protocol with refined score rubrics for Instruction Following and Visual Consistency, plus a dual-mode framework (Tool-driven and Oracle-guided Modes) intended to reduce subjectivity and correct LMM-as-a-Judge misalignment with human judgments. Empirical evaluation of 10 IIEMs reveals significant performance gaps, motivating the need for specialized benchmarks.

Significance. If the evaluation protocol is shown to be reliable, the benchmark would usefully highlight underexplored limitations in current IIEMs for precise local edits on small objects, potentially guiding targeted improvements in model architectures and training for fine-grained image manipulation tasks.

major comments (1)
  1. The dual-mode evaluation framework is described as successfully minimizing subjectivity and aligning with human judgments on DLEBench. However, the manuscript reports no quantitative validation metrics (e.g., correlation coefficients, Cohen's kappa, or agreement rates) between protocol scores and human annotations on any subset of the 1889 samples. This is load-bearing for the central claim, because the reported performance gaps across the 10 IIEMs cannot be confidently attributed to model limitations without evidence that the rubrics and modes produce scores consistent with human consensus on Instruction Following and Visual Consistency.
minor comments (1)
  1. The abstract and methods description would benefit from a brief table or breakdown showing the number of samples per instruction type and per scenario (e.g., occlusion vs. multi-object) to clarify the benchmark composition.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below and commit to revisions that strengthen the empirical validation of the proposed evaluation protocol.

read point-by-point responses
  1. Referee: The dual-mode evaluation framework is described as successfully minimizing subjectivity and aligning with human judgments on DLEBench. However, the manuscript reports no quantitative validation metrics (e.g., correlation coefficients, Cohen's kappa, or agreement rates) between protocol scores and human annotations on any subset of the 1889 samples. This is load-bearing for the central claim, because the reported performance gaps across the 10 IIEMs cannot be confidently attributed to model limitations without evidence that the rubrics and modes produce scores consistent with human consensus on Instruction Following and Visual Consistency.

    Authors: We agree that the absence of quantitative validation metrics represents a substantive gap in supporting the central claim. The manuscript describes the design of the refined rubrics for Instruction Following and Visual Consistency together with the dual-mode (Tool-driven and Oracle-guided) framework intended to reduce LMM misalignment, yet it does not report correlation coefficients, Cohen's kappa, or agreement rates against human annotations on any subset of the 1889 samples. This omission limits the strength of the argument that observed performance differences across the ten IIEMs can be confidently attributed to model limitations rather than evaluation artifacts. In the revised manuscript we will add a dedicated human validation study performed on a stratified subset of samples. We will report inter-annotator agreement (Cohen's kappa), correlation between protocol scores and mean human ratings (Pearson/Spearman), and raw agreement percentages separately for each criterion and mode. These results will be presented in a new subsection and table, directly addressing the load-bearing concern raised. revision: yes

Circularity Check

0 steps flagged

No circularity: direct benchmark construction and external model evaluation

full rationale

The paper constructs DLEBench (1889 samples across seven instruction types with small objects occupying 1-10% area) and proposes an evaluation protocol with refined rubrics plus dual-mode (Tool-driven/Oracle-guided) framework to address LMM misalignment. No mathematical derivations, fitted parameters, self-referential predictions, or load-bearing self-citations appear in the provided text. Results consist of empirical testing on 10 external IIEMs, with the protocol presented as a methodological contribution rather than a derived quantity that reduces to its own inputs by construction. This is a standard benchmark paper whose central claims rest on sample curation and human-aligned scoring rather than any closed loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper. No free parameters, mathematical axioms, or invented entities are introduced; the contribution rests on manual sample curation and rubric design rather than derivations or new theoretical constructs.

pith-pipeline@v0.9.0 · 5771 in / 1117 out tokens · 75369 ms · 2026-05-21T12:00:31.189450+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 10 internal anchors

  1. [1]

    Unireal: Universal image generation and editing via learning real-world dynamics

    Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye Kim, Qing Liu, Yijun Li, Jianming Zhang, Nanxuan Zhao, Yilin Wang, et al. Unireal: Universal image generation and editing via learning real-world dynamics. InProceedings ofthe ComputerVisionandPatternRecognitionConference, pages 12501–12511, 2025

  2. [2]

    Diffedit: Diffusion-based seman- tic image editing with mask guidance.arXiv preprint arXiv:2210.11427, 2022

    GuillaumeCouairon, JakobVerbeek, HolgerSchwenk, andMatthieuCord. Diffedit: Diffusion-basedsemanticimage editing with mask guidance.http://arxiv.org/abs/2210.11427, 2022. doi: 10.48550/arxiv.2210.11427

  3. [3]

    Chatedit: Towards multi-turn interactive facial image editing via dialogue

    Xing Cui, Zekun Li, Pei Li, Yibo Hu, Hailin Shi, Chunshui Cao, and Zhaofeng He. Chatedit: Towards multi-turn interactive facial image editing via dialogue. InProceedings of the 2023 Conference on Empirical Methods in NaturalLanguageProcessing, pages 14567–14583, 2023

  4. [4]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025

  5. [5]

    Guidinginstruction-basedimage editing via multimodal large language models.arXiv preprintarXiv:2309.17102, 2023

    Tsu-JuiFu,WenzeHu,XianzhiDu,WilliamYangWang,YinfeiYang,andZheGan. Guidinginstruction-basedimage editing via multimodal large language models.arXiv preprintarXiv:2309.17102, 2023

  6. [6]

    Instructdiffusion: Ageneralistmodelinginterfaceforvisiontasks

    Zigang Geng, Binxin Yang, Tiankai Hang, Chen Li, Shuyang Gu, Ting Zhang, Jianmin Bao, Zheng Zhang, Houqiang Li,HanHu,etal. Instructdiffusion: Ageneralistmodelinginterfaceforvisiontasks. In ProceedingsoftheIEEE/CVF Conferenceon computervision and pattern recognition, pages 12709–12720, 2024

  7. [7]

    Mask-Guided Portrait Editing with Conditional GANs

    Shuyang Gu, Jianmin Bao, Hao Yang, Dong Chen, Fang Wen, and Lu Yuan. Mask-guided portrait editing with conditional gans.http://arxiv.org/abs/1905.10346, 2022. doi: 10.48550/arxiv.1905.10346

  8. [8]

    Unireditbench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025

    Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, et al. Unireditbench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025

  9. [9]

    Frabench and ufeval: Unified fine-grained evaluation with task and aspect generalization, 2025

    Shibo Hong, Jiahao Ying, Haiyuan Liang, Mengdi Zhang, Jun Kuang, Jiazheng Zhang, and Yixin Cao. Frabench and ufeval: Unified fine-grained evaluation with task and aspect generalization, 2025. URLhttps://arxiv.org/abs/ 2505.12795

  10. [10]

    Paralleledits: Efficient multi-aspect text-driven image editing with attention grouping

    Mingzhen Huang, Jialing Cai, Shan Jia, Vishnu Lokhande, and Siwei Lyu. Paralleledits: Efficient multi-aspect text-driven image editing with attention grouping. InAdvancesinNeuralInformationProcessingSystems, 2024

  11. [11]

    Instruction-based image editing with planning, reasoning, and generation

    Liya Ji, Chenyang Qi, and Qifeng Chen. Instruction-based image editing with planning, reasoning, and generation. In Proceedings ofthe IEEE/CVF International ConferenceonComputerVision, pages 17506–17515, 2025

  12. [12]

    Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception.arXiv preprint arXiv:2506.17733, 2025

    Mengqi Lei, Siqi Li, Yihong Wu, Han Hu, You Zhou, Xinhu Zheng, Guiguang Ding, Shaoyi Du, Zongze Wu, and Yue Gao. Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception.arXiv preprint arXiv:2506.17733, 2025

  13. [13]

    Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

    Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, et al. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprintarXiv:2510.16888, 2025

  14. [14]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprintarXiv:2506.03147, 2025

  15. [15]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXivpreprintarXiv:2504.17761, 2025

  16. [16]

    I2ebench: A comprehensive benchmark for instruction-based image editing.Advancesin NeuralInformation ProcessingSystems, 37:41494–41516, 2024

    Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, and Rongrong Ji. I2ebench: A comprehensive benchmark for instruction-based image editing.Advancesin NeuralInformation ProcessingSystems, 37:41494–41516, 2024

  17. [17]

    Gie-bench: Towards grounded evaluation for text-guided image editing.arXiv preprintarXiv:2505.11493, 2025

    Yusu Qian, Jiasen Lu, Tsu-Jui Fu, Xinze Wang, Chen Chen, Yinfei Yang, Wenze Hu, and Zhe Gan. Gie-bench: Towards grounded evaluation for text-guided image editing.arXiv preprintarXiv:2505.11493, 2025. 48

  18. [18]

    Grounding dino 1.5: Advance the" edge" of open-set object detection.arXiv preprint arXiv:2405.10300, 2024

    Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the" edge" of open-set object detection.arXiv preprint arXiv:2405.10300, 2024

  19. [19]

    Emu edit: Precise image editing via recognition and generation tasks

    Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings ofthe IEEE/CVF Conferenceon ComputerVisionand PatternRecognition, pages 8871–8879, 2024

  20. [20]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXivpreprintarXiv:2505.15966, 2025

  21. [21]

    Instructedit: Improving automatic masks for diffusion- based image editing with user instructions.http://hdl.handle.net/10754/692507, 2023

    Qian Wang, Zhang Biao, Michael Birsak, and Peter Wonka. Instructedit: Improving automatic masks for diffusion- based image editing with user instructions.http://hdl.handle.net/10754/692507, 2023

  22. [22]

    Real-esrgan: Training real-world blind super-resolution with pure synthetic data

    Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InProceedings of the IEEE/CVF international conference on computer vision, pages 1905–1914, 2021

  23. [23]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprintarXiv:2508.02324, 2025

  24. [24]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    ChenyuanWu,PengfeiZheng,RuiranYan,ShitaoXiao,XinLuo,YuezeWang,WanliLi,XiyanJiang,YexinLiu,Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXivpreprintarXiv:2506.18871, 2025

  25. [25]

    V*: Guided visual search as a core mechanism in multimodal llms.arXiv preprint arXiv:2312.14135, 2023

    Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms.arXivpreprint arXiv:2312.14135, 2023

  26. [26]

    Kris-bench: Benchmarking next-level intelligent image editing models.arXivpreprint arXiv:2505.16707, 2025

    Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming- Hsuan Yang, and Xu Yang. Kris-bench: Benchmarking next-level intelligent image editing models.arXivpreprint arXiv:2505.16707, 2025

  27. [27]

    ImgEdit: A Unified Image Editing Dataset and Benchmark

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXivpreprintarXiv:2505.20275, 2025

  28. [28]

    Anyedit: Mastering unified high-quality image editing for any idea

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the ComputerVisionandPatternRecognitionConference, pages 26125–26135, 2025

  29. [29]

    Anyedit: Mastering unified high-quality image editing for any idea, 2025

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea, 2025. URL https://arxiv.org/abs/2411.15738

  30. [30]

    Magicbrush: A manually annotated dataset for instruction-guided image editing

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing. InAdvancesin NeuralInformationProcessingSystems, 2023

  31. [31]

    MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXivpreprintarXiv:2408.13257, 2024

  32. [32]

    Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing

    Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing. arXiv preprintarXiv:2504.02826, 2025

  33. [33]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural informationprocessingsystems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural informationprocessingsystems, 36:46595–46623, 2023. 49