pith. sign in

arxiv: 2606.00931 · v1 · pith:2U2KN536new · submitted 2026-05-30 · 💻 cs.CV · cs.AI

CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

Pith reviewed 2026-06-28 18:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords image editinginstruction followingcomputer vision benchmarkAI evaluationhuman-AI collaborationagentic modelspreference aggregationvisual reasoning
0
0 comments X

The pith

CV-Arena benchmark reveals persistent gaps in AI models for instruction-guided real-image editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines instructional computer vision problem solving as editing a real input image according to a natural-language instruction while satisfying explicit preservation, geometric, physical, and usability constraints. It introduces CV-Arena, a benchmark of 12,000 high-resolution instruction pairs across 16 task types built through a retrieval-and-curation pipeline. The work also presents Active Elo, a hybrid evaluation protocol that uses a logic-gated VLM judge to handle clear cases and routes only ambiguous ones to human raters. Comprehensive tests of 21 systems show shortfalls in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation. A new agentic model called CV-Agent demonstrates that adding planning and verification loops can address some of these shortfalls.

Core claim

Instructional computer vision problem solving on real images requires satisfying multiple explicit constraints beyond simple appearance changes. CV-Arena supplies 12K traceable instruction pairs spanning 16 task types. Active Elo aggregates reliability-weighted human and AI preferences by letting a multi-dimensional VLM evaluator reject clear failures and route close comparisons to experts. Evaluation of 21 proprietary, open-source, and agentic systems on this benchmark shows consistent shortfalls in instruction adherence, physical reasoning, structural control, and detail preservation. CV-Agent, which interleaves planning, editing, and verification, indicates that closed-loop reasoning impr

What carries the argument

CV-Arena benchmark of 12K real-image instruction pairs evaluated through the Active Elo protocol, which combines a logic-gated VLM judge with selective human review.

If this is right

  • Current models require targeted advances in physical reasoning to handle realistic image transformations.
  • Structural control and fine-grained detail preservation remain unreliable even in the strongest systems tested.
  • Agentic designs that close the loop with planning and verification steps improve instruction following on the benchmark tasks.
  • Hybrid human-AI preference collection via Active Elo enables scalable yet high-fidelity evaluation of editing systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the 16 task types capture core professional needs, similar retrieval pipelines could be applied to create benchmarks for video or 3D editing.
  • The reliability-weighted Elo updates used here could reduce labeling costs when evaluating other generative systems such as text-to-image or code generation.
  • Persistent gaps across model classes suggest that simply scaling existing architectures may not close the distance to professional-grade performance.
  • Integrating explicit verification steps into deployed editing tools could raise reliability without requiring full model retraining.

Load-bearing premise

The CogRetriever pipeline produces instruction pairs that are representative of professional workflows and free of systematic bias in task distribution or image selection.

What would settle it

A single model that scores near ceiling across all 16 task types on CV-Arena and also succeeds on an independent set of real professional editing jobs would falsify the claim of persistent gaps.

Figures

Figures reproduced from arXiv: 2606.00931 by Fangzhou Lin, Haichong Zhang, Kazunori Yamada, Lingyu Xu, Ming-Hsuan Yang, Mingyang Wu, Peiran Li, Qianwen Ge, Shuo Xing, Siyuan Yang, Wenjing Chen, Xiangbo Gao, Zhen Dong, Zhengzhong Tu, Ziming Zhang.

Figure 1
Figure 1. Figure 1: The Overall Pipeline. The framework starts with data curation, where CogRetriever constructs a professional-grade dataset. Then, it is followed by model/agent benchmarking and an Active Elo Evaluation, where CV-Judge generates scores and filters outputs using two-gate constraints, while routing ambiguous and high-quality comparisons to human experts. Final rankings are produced through Active Elo with a re… view at source ↗
Figure 2
Figure 2. Figure 2: Dataset statistics and User Interface. From left to right: (a) the data composition across different sources; (b) shows the image-resolution distributions; and (c) a zoom-in function to check details during human rating. 3.3 Data Acquisition, Filtering, and Human Verification To collect images satisfying the above criteria at scale, we develop CogRetriever, a Text-Initiated Multimodal Search pipeline with … view at source ↗
Figure 3
Figure 3. Figure 3: Average Win Rate Against Top-10 Models with three settings from left to right: Active [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Bootstrap of Elo Estimates (1000 Rounds of Random Sampling) on Top-10 Models with [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison among different editing solutions with reasoning and complex tasks. Our proposed simple baseline CV-Agent consistently produces more faithful, constraint￾satisfying edits, preserving non-target content and structure while better following the instruction than strong single-pass editors. 6 Experiments We benchmark a broad set of instructional image editing systems on CV-Arena, includi… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Comparison Among Different Editing Solutions with low level tasks. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Comparison Among Different Editing Solutions with failure cases. I Appendix: CV-Judge VLM Backbone Sensitivity CV-Judge is instantiated with GPT-4o as its backbone VLM, primarily due to the favorable balance between evaluation quality and API cost at the scale of CV-Arena. To verify that our findings are not artifacts of this specific choice, we conducted preliminary cross-VLM comparisons using… view at source ↗
read the original abstract

Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow appearance edits and do not fully capture the diversity of real-image tasks in professional workflows. Here, we define instructional computer vision problem solving as a broader formulation of image editing: given a real input image and a natural-language instruction, a system must produce an edited output that realizes the requested transformation while satisfying explicit preservation, geometric, physical, and usability constraints. We introduce CV-Arena, an open benchmark designed to evaluate this capability at professional scales. CV-Arena contains 12K high-resolution real-image instruction pairs spanning 16 instruction-based visual task types, constructed using CogRetriever, a dual-track retrieval-and-curation pipeline that combines targeted web search, agentic query refinement, verification, and traceability. To evaluate models at scale while preserving human fidelity, we propose Active Elo, a human-AI collaborative preference protocol that leverages CV-Judge, a logic-gated, multi-dimensional VLM evaluator, to reject clear failures and resolve high-confidence comparisons; and to route close, high-quality comparisons to expert raters. Mixed human and AI supervision is then aggregated through reliability-weighted Elo updates. Our comprehensive evaluation of 21 systems, including proprietary, open-source, and agentic models, on CV-Arena reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation. We further develop CV-Agent, a lightweight agentic model that combines planning, editing, and verification, and demonstrate that closed-loop reasoning is a promising direction for professional-grade instruction-following visual editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CV-Arena, a benchmark of 12K high-resolution real-image instruction pairs spanning 16 task types for instructional computer vision problem solving (broader than narrow appearance edits, incorporating preservation, geometric, physical, and usability constraints). It describes the CogRetriever dual-track retrieval-and-curation pipeline (targeted web search, agentic refinement, verification, traceability), proposes Active Elo (human-AI collaborative preference protocol using CV-Judge, a logic-gated multi-dimensional VLM evaluator, to route comparisons and aggregate via reliability-weighted Elo), evaluates 21 proprietary/open-source/agentic systems revealing persistent gaps in instruction adherence, physical reasoning, structural control, and detail preservation, and presents CV-Agent (lightweight agentic model with planning/editing/verification) as a promising direction.

Significance. If the benchmark construction and evaluation protocol are validated as representative and reliable, the work supplies a large-scale, open resource that better matches professional image-editing workflows than prior narrow benchmarks; the human-AI hybrid protocol and explicit failure-mode taxonomy could accelerate progress on instruction-following visual models, while the agentic baseline offers a concrete starting point for closed-loop systems.

major comments (2)
  1. [Abstract / CogRetriever pipeline description] Abstract and § on benchmark construction: the headline claim that evaluation of 21 systems 'reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation' is load-bearing for the paper's contribution, yet no per-task rejection statistics, source-domain entropy, inter-annotator agreement on curation decisions, or comparison against an external professional-editing corpus are supplied to establish that CogRetriever pairs are free of systematic bias or representative of professional workflows.
  2. [Active Elo and CV-Judge description] Evaluation protocol section: the soundness of the reported gaps depends on Active Elo and CV-Judge; the manuscript supplies no quantitative validation of pair quality, inter-rater agreement statistics, or ablation studies isolating the contribution of the logic-gated VLM rejection step versus full human rating.
minor comments (2)
  1. [Abstract] The abstract states '12K high-resolution real-image instruction pairs' but does not include a table or figure summarizing the distribution across the 16 task types; adding this would improve reproducibility and allow readers to assess coverage.
  2. [Method overview] Notation for 'CV-Judge' and 'Active Elo' is introduced without an explicit forward reference to their formal definitions or pseudocode; a dedicated subsection or algorithm box would clarify the multi-dimensional scoring logic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. The two major comments highlight important areas for strengthening the validation of both the benchmark construction and the evaluation protocol. We address each point below and commit to revisions that incorporate the requested quantitative evidence where feasible.

read point-by-point responses
  1. Referee: [Abstract / CogRetriever pipeline description] Abstract and § on benchmark construction: the headline claim that evaluation of 21 systems 'reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation' is load-bearing for the paper's contribution, yet no per-task rejection statistics, source-domain entropy, inter-annotator agreement on curation decisions, or comparison against an external professional-editing corpus are supplied to establish that CogRetriever pairs are free of systematic bias or representative of professional workflows.

    Authors: We agree that these quantitative details are necessary to fully substantiate the representativeness claim. In the revised manuscript we will add: (i) per-task rejection statistics from the CogRetriever verification stage, (ii) source-domain entropy computed over the 12K pairs, and (iii) inter-annotator agreement (Cohen’s kappa) on the agentic refinement and human verification decisions. For the requested comparison against an external professional-editing corpus, no public corpus matching the 16-task breadth and real-image constraint set currently exists; we will instead provide a qualitative mapping of our task taxonomy to documented professional workflows (e.g., from Adobe, Figma, and photography post-production literature) and report how the 16 categories were derived from those sources. revision: yes

  2. Referee: [Active Elo and CV-Judge description] Evaluation protocol section: the soundness of the reported gaps depends on Active Elo and CV-Judge; the manuscript supplies no quantitative validation of pair quality, inter-rater agreement statistics, or ablation studies isolating the contribution of the logic-gated VLM rejection step versus full human rating.

    Authors: We concur that empirical validation of the hybrid protocol is essential. The revised version will include: (i) inter-rater agreement (Fleiss’ kappa) between expert human raters and between humans and CV-Judge on a held-out subset of 500 pairs, (ii) pair-quality metrics (e.g., consistency of preference outcomes across repeated judgments), and (iii) an ablation comparing full-human rating against the logic-gated CV-Judge routing in terms of both agreement with final Elo rankings and annotation cost. These additions will directly quantify the reliability of the reported model gaps. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and empirical evaluation are self-contained

full rationale

The paper introduces CV-Arena as a new benchmark via the CogRetriever pipeline and evaluates 21 systems using Active Elo with CV-Judge; no mathematical derivations, parameter fittings presented as predictions, or self-citation chains appear in the central claims. The construction pipeline and evaluation protocol are described directly without reducing any result to its own inputs by definition. The reader's assessment of score 1.0 aligns with the absence of any load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The central claims rest on the assumption that the new curation and evaluation constructs are valid without external validation data shown in the abstract; no free parameters or mathematical axioms are invoked.

invented entities (3)
  • CV-Judge no independent evidence
    purpose: logic-gated multi-dimensional VLM evaluator for routing comparisons
    Introduced as a component of Active Elo; no independent evidence supplied.
  • Active Elo no independent evidence
    purpose: human-AI collaborative preference aggregation protocol
    New evaluation method described in abstract; no prior validation cited.
  • CV-Agent no independent evidence
    purpose: lightweight agentic model combining planning, editing, and verification
    Proposed system whose performance is reported but not independently verified.

pith-pipeline@v0.9.1-grok · 5878 in / 1264 out tokens · 19396 ms · 2026-06-28T18:35:00.175761+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 30 canonical work pages · 14 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  3. [3]

    Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18392–18402, 2022

  4. [4]

    Instruction-based image manipulation by watching how things move

    Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, and Zhihao Xia. Instruction-based image manipulation by watching how things move. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2704–2713, 2025

  5. [6]

    Bytemorph: Benchmarking instruction-guided image editing with non-rigid motions.arXiv preprint arXiv:2506.03107, 2025

    Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gordon Wetzstein, Mohammad Soleymani, and Peng Wang. Bytemorph: Benchmarking instruction-guided image editing with non-rigid motions.arXiv preprint arXiv:2506.03107, 2025

  6. [7]

    Chen, Misha Sra, and Pradeep Sen

    Sherry X. Chen, Misha Sra, and Pradeep Sen. Instruct-clip: Improving instruction-guided image editing with automated data refinement using contrastive learning, 2025

  7. [8]

    Hierarchical integration diffusion model for realistic image deblurring.Advances in neural information processing systems, 36:29114–29125, 2023

    Zheng Chen, Yulun Zhang, Ding Liu, Jinjin Gu, Linghe Kong, Xin Yuan, et al. Hierarchical integration diffusion model for realistic image deblurring.Advances in neural information processing systems, 36:29114–29125, 2023

  8. [9]

    Opengpt-4o-image: A comprehensive dataset for advanced image generation and editing.arXiv preprint arXiv:2509.24900, 2025

    Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, et al. Opengpt-4o-image: A comprehensive dataset for advanced image generation and editing.arXiv preprint arXiv:2509.24900, 2025

  9. [10]

    Exploring the naturalness of ai-generated images.arXiv preprint arXiv:2312.05476, 2023

    Zijian Chen, Wei Sun, Haoning Wu, Zicheng Zhang, Jun Jia, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, Guangtao Zhai, et al. Exploring the naturalness of ai-generated images.arXiv preprint arXiv:2312.05476, 2023

  10. [11]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference.arXiv preprint arXiv:2403.04132, 2024

  11. [12]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  12. [13]

    Image harmonization dataset iharmony4: Hcoco, hadobe5k, hflickr, and hday2night.arXiv preprint arXiv:1908.10526, 2019

    Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, Zhixin Ling, Weiyuan Li, and Liqing Zhang. Image harmonization dataset iharmony4: Hcoco, hadobe5k, hflickr, and hday2night.arXiv preprint arXiv:1908.10526, 2019. 10

  13. [14]

    Introducing gemini 2.5 flash image, our state-of-the-art image model, 2025

    Google DeepMind. Introducing gemini 2.5 flash image, our state-of-the-art image model, 2025

  14. [15]

    Introducing nano banana pro, 2025

    Google DeepMind. Introducing nano banana pro, 2025

  15. [16]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  16. [17]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

  17. [18]

    Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems, 36:30039–30069, 2023

    Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems, 36:30039–30069, 2023

  18. [19]

    Image classification based on cnn: a survey.Journal of Cybersecurity and Information Management, 6(1):18–50, 2021

    Ahmed A Elngar, Mohamed Arafa, Amar Fathy, Basma Moustafa, Omar Mahmoud, Mohamed Shaban, and Nehal Fawzy. Image classification based on cnn: a survey.Journal of Cybersecurity and Information Management, 6(1):18–50, 2021

  19. [20]

    Recognition-synergistic scene text editing

    Zhengyao Fang, Pengyuan Lyu, Jingjing Wu, Chengquan Zhang, Jun Yu, Guangming Lu, and Wenjie Pei. Recognition-synergistic scene text editing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13104–13113, 2025

  20. [21]

    prentice hall professional technical reference, 2002

    David A Forsyth and Jean Ponce.Computer vision: a modern approach. prentice hall professional technical reference, 2002

  21. [22]

    Seed-data-edit technical report: A hybrid dataset for instructional image editing.arXiv preprint arXiv:2405.04007, 2024

    Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. Seed-data-edit technical report: A hybrid dataset for instructional image editing.arXiv preprint arXiv:2405.04007, 2024

  22. [23]

    Gemini 2.5 pro model card, 2025

    Google. Gemini 2.5 pro model card, 2025. Accessed: 2026-01-11

  23. [24]

    Multi- reward as condition for instruction-based image editing

    Xin Gu, Ming Li, Libo Zhang, Fan Chen, Longyin Wen, Tiejian Luo, and Sijie Zhu. Multi- reward as condition for instruction-based image editing. InThe Thirteenth International Conference on Learning Representations, 2025

  24. [25]

    Improving visual and downstream performance of low-light enhancer with vision foun- dation models collaboration

    Yuxuan Gu, Haoxuan Wang, Pengyang Ling, Zhixiang Wei, Huaian Chen, Yi Jin, and Enhong Chen. Improving visual and downstream performance of low-light enhancer with vision foun- dation models collaboration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16071–16080, 2025

  25. [26]

    Large Language Model based Multi-Agents: A Survey of Progress and Challenges

    Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024

  26. [27]

    Clip and complementary methods.Nature Reviews Methods Primers, 1(1):20, 2021

    Markus Hafner, Maria Katsantoni, Tino Köster, James Marks, Joyita Mukherjee, Dorothee Staiger, Jernej Ule, and Mihaela Zavolan. Clip and complementary methods.Nature Reviews Methods Primers, 1(1):20, 2021

  27. [28]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

  28. [29]

    Hq-edit: A high-quality dataset for instruction-based image editing.arXiv preprint arXiv:2404.09990, 2024

    Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. Hq-edit: A high-quality dataset for instruction-based image editing.arXiv preprint arXiv:2404.09990, 2024

  29. [30]

    Grounding degradations in natural language for all-in-one video restoration

    Muhammad Kamran Janjua, Amirhosein Ghasemabadi, Kunlin Zhang, Mohammad Salameh, Chao Gao, and Di Niu. Grounding degradations in natural language for all-in-one video restoration. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5734–5743, 2026. 11

  30. [31]

    Compbench: Benchmarking complex instruction-guided image editing.arXiv preprint arXiv:2505.12200, 2025

    Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng Feng, Zhouhong Gu, Zhenfei Yin, et al. Compbench: Benchmarking complex instruction-guided image editing.arXiv preprint arXiv:2505.12200, 2025

  31. [32]

    Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

    Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, and Wenhu Chen. Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

  32. [33]

    A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

    Maurice G Kendall. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

  33. [34]

    Orida: Object-centric real-world image composition dataset

    Jinwoo Kim, Sangmin Han, Jinho Jeong, Jiwoo Choi, Dongyeoung Kim, and Seon Joo Kim. Orida: Object-centric real-world image composition dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3051–3060, 2025

  34. [35]

    Peak signal-to-noise ratio revisited: Is simple beautiful? In 2012 Fourth International Workshop on Quality of Multimedia Experience, pages 37–38, 2012

    Jari Korhonen and Junyong You. Peak signal-to-noise ratio revisited: Is simple beautiful? In 2012 Fourth International Workshop on Quality of Multimedia Experience, pages 37–38, 2012

  35. [36]

    Imagenhub: Standardizing the evaluation of conditional image generation models.arXiv preprint arXiv:2310.01596, 2023

    Max Ku, Tianle Li, Kai Zhang, Yujie Lu, Xingyu Fu, Wenwen Zhuang, and Wenhu Chen. Imagenhub: Standardizing the evaluation of conditional image generation models.arXiv preprint arXiv:2310.01596, 2023

  36. [37]

    FLUX.2: Frontier Visual Intelligence

    Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025

  37. [38]

    The mnist database of handwritten digits.http://yann

    Yann LeCun. The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998

  38. [39]

    Superedit: Rectifying and facilitating supervision for instruction-based image editing

    Ming Li, Xin Gu, Fan Chen, Xiaoying Xing, Longyin Wen, Chen Chen, and Sijie Zhu. Superedit: Rectifying and facilitating supervision for instruction-based image editing. 2025

  39. [40]

    Towards benchmarking and assessing visual naturalness of physical world adversarial attacks

    Simin Li, Shuning Zhang, Gujun Chen, Dong Wang, Pu Feng, Jiakai Wang, Aishan Liu, Xin Yi, and Xianglong Liu. Towards benchmarking and assessing visual naturalness of physical world adversarial attacks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12324–12333, 2023

  40. [41]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025

  41. [42]

    Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

    Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, et al. Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

  42. [43]

    Referring image editing: Object-level image editing via referring expressions

    Chang Liu, Xiangtai Li, and Henghui Ding. Referring image editing: Object-level image editing via referring expressions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13128–13138, 2024

  43. [44]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for general image editing.arXiv pre...

  44. [45]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language under- standing.arXiv preprint arXiv:2403.05525, 2024

  45. [46]

    Visual-instructed degradation diffusion for all-in-one image restoration

    Wenyang Luo, Haina Qin, Zewen Chen, Libin Wang, Dandan Zheng, Yuming Li, Yufan Liu, Bing Li, and Weiming Hu. Visual-instructed degradation diffusion for all-in-one image restoration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12764–12777, 2025

  46. [47]

    Visual autoregressive modeling for instruction-guided image editing.arXiv preprint, 2025

    Qingyang Mao, Qi Cai, Yehao Li, Yingwei Pan, Mingyue Cheng, Ting Yao, Qi Liu, and Tao Mei. Visual autoregressive modeling for instruction-guided image editing.arXiv preprint, 2025. 12

  47. [48]

    Introducing manus 1.6: Max performance, mobile dev, and design view, 2025

    Manus Meta. Introducing manus 1.6: Max performance, mobile dev, and design view, 2025

  48. [49]

    Image segmentation using deep learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(7):3523–3542, 2021

    Shervin Minaee, Yuri Boykov, Fatih Porikli, Antonio Plaza, Nasser Kehtarnavaz, and Demetri Terzopoulos. Image segmentation using deep learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(7):3523–3542, 2021

  49. [50]

    A comprehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology, 16(5):1–72, 2025

    Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology, 16(5):1–72, 2025

  50. [51]

    Gpt image 1, 2025

    OpenAI. Gpt image 1, 2025

  51. [52]

    Introducing chatgpt agent: bridging research and action, 2025

    OpenAI. Introducing chatgpt agent: bridging research and action, 2025. Accessed: 2026-01-26

  52. [53]

    The new chatgpt images is here, 2025

    OpenAI. The new chatgpt images is here, 2025

  53. [54]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  54. [55]

    The claude 3 model family: Opus, sonnet, haiku

    Anthropic PBC. The claude 3 model family: Opus, sonnet, haiku

  55. [56]

    Dreambench++: A human-aligned benchmark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

    Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

  56. [57]

    Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

    Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

  57. [58]

    Ex- ploring stroke-level modifications for scene text editing

    Yadong Qu, Qingfeng Tan, Hongtao Xie, Jianjun Xu, Yuxin Wang, and Yongdong Zhang. Ex- ploring stroke-level modifications for scene text editing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 2119–2127, 2023

  58. [59]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  59. [60]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

  60. [61]

    Insert anything: Image insertion via in-context editing in dit

    Wensong Song, Hong Jiang, Zongxin Yang, Zheqiao Cheng, Ruijie Quan, and Yi Yang. Insert anything: Image insertion via in-context editing in dit. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 9097–9105, 2026

  61. [62]

    Springer Nature, 2022

    Richard Szeliski.Computer vision: algorithms and applications. Springer Nature, 2022

  62. [63]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  63. [64]

    Add- it: Training-free object insertion in images with pretrained diffusion models.arXiv preprint arXiv:2411.07232, 2024

    Yoad Tewel, Rinon Gal, Dvir Samuel, Yuval Atzmon, Lior Wolf, and Gal Chechik. Add- it: Training-free object insertion in images with pretrained diffusion models.arXiv preprint arXiv:2411.07232, 2024

  64. [65]

    What makes an image realistic?arXiv preprint arXiv:2403.04493, 2024

    Lucas Theis. What makes an image realistic?arXiv preprint arXiv:2403.04493, 2024

  65. [66]

    Deep learning for computer vision: A brief review.Computational intelligence and neuroscience, 2018(1):7068349, 2018

    Athanasios V oulodimos, Nikolaos Doulamis, Anastasios Doulamis, and Eftychios Protopa- padakis. Deep learning for computer vision: A brief review.Computational intelligence and neuroscience, 2018(1):7068349, 2018. 13

  66. [67]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  67. [68]

    Adapting text-to-image generation with feature difference instruction for generic image restoration

    Chao Wang, Hehe Fan, Huichen Yang, Sarvnaz Karimi, Lina Yao, and Yi Yang. Adapting text-to-image generation with feature difference instruction for generic image restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23539–23550, 2025

  68. [69]

    Complexbench- edit: Benchmarking complex instruction-driven image editing via compositional dependencies

    Chenglin Wang, Yucheng Zhou, Qianning Wang, Zhe Wang, and Kai Zhang. Complexbench- edit: Benchmarking complex instruction-driven image editing via compositional dependencies. InProceedings of the 33rd ACM International Conference on Multimedia, pages 13391–13397, 2025

  69. [70]

    Vision-zero: Scalable vlm self-improvement via strategic gamified self-play

    Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, and Wentian Zhao. Vision-zero: Scalable vlm self-improvement via strategic gamified self-play. arXiv preprint arXiv:2509.25541, 2025

  70. [71]

    Rl-vlm-f: Reinforcement learning from vision language foundation model feedback

    Yufei Wang, Zhanyi Sun, Jesse Zhang, Zhou Xian, Erdem Biyik, David Held, and Zackory Erickson. Rl-vlm-f: Reinforcement learning from vision language foundation model feedback. arXiv preprint arXiv:2402.03681, 2024

  71. [72]

    Bovik, H.R

    Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

  72. [73]

    Objectdrop: Bootstrapping counterfactuals for photorealistic object removal and insertion

    Daniel Winter, Matan Cohen, Shlomi Fruchter, Yael Pritch, Alex Rav-Acha, and Yedid Hoshen. Objectdrop: Bootstrapping counterfactuals for photorealistic object removal and insertion. In European Conference on Computer Vision, pages 112–129. Springer, 2024

  73. [74]

    Qwen-image technical report, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

  74. [75]

    Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

    Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu, and Wenhu Chen. Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

  75. [76]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  76. [77]

    ImgEdit: A Unified Image Editing Dataset and Benchmark

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

  77. [78]

    Anyedit: Mastering unified high-quality image editing for any idea

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26125–26135, 2025

  78. [79]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung- Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022. 14

  79. [80]

    Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

    Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

  80. [81]

    Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

Showing first 80 references.