CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

Fangzhou Lin; Haichong Zhang; Kazunori Yamada; Lingyu Xu; Ming-Hsuan Yang; Mingyang Wu; Peiran Li; Qianwen Ge; Shuo Xing; Siyuan Yang

arxiv: 2606.00931 · v1 · pith:2U2KN536new · submitted 2026-05-30 · 💻 cs.CV · cs.AI

CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

Fangzhou Lin , Peiran Li , Lingyu Xu , Wenjing Chen , Qianwen Ge , Shuo Xing , Mingyang Wu , Xiangbo Gao

show 7 more authors

Siyuan Yang Kazunori Yamada Ziming Zhang Haichong Zhang Zhen Dong Ming-Hsuan Yang Zhengzhong Tu

This is my paper

Pith reviewed 2026-06-28 18:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords image editinginstruction followingcomputer vision benchmarkAI evaluationhuman-AI collaborationagentic modelspreference aggregationvisual reasoning

0 comments

The pith

CV-Arena benchmark reveals persistent gaps in AI models for instruction-guided real-image editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines instructional computer vision problem solving as editing a real input image according to a natural-language instruction while satisfying explicit preservation, geometric, physical, and usability constraints. It introduces CV-Arena, a benchmark of 12,000 high-resolution instruction pairs across 16 task types built through a retrieval-and-curation pipeline. The work also presents Active Elo, a hybrid evaluation protocol that uses a logic-gated VLM judge to handle clear cases and routes only ambiguous ones to human raters. Comprehensive tests of 21 systems show shortfalls in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation. A new agentic model called CV-Agent demonstrates that adding planning and verification loops can address some of these shortfalls.

Core claim

Instructional computer vision problem solving on real images requires satisfying multiple explicit constraints beyond simple appearance changes. CV-Arena supplies 12K traceable instruction pairs spanning 16 task types. Active Elo aggregates reliability-weighted human and AI preferences by letting a multi-dimensional VLM evaluator reject clear failures and route close comparisons to experts. Evaluation of 21 proprietary, open-source, and agentic systems on this benchmark shows consistent shortfalls in instruction adherence, physical reasoning, structural control, and detail preservation. CV-Agent, which interleaves planning, editing, and verification, indicates that closed-loop reasoning impr

What carries the argument

CV-Arena benchmark of 12K real-image instruction pairs evaluated through the Active Elo protocol, which combines a logic-gated VLM judge with selective human review.

If this is right

Current models require targeted advances in physical reasoning to handle realistic image transformations.
Structural control and fine-grained detail preservation remain unreliable even in the strongest systems tested.
Agentic designs that close the loop with planning and verification steps improve instruction following on the benchmark tasks.
Hybrid human-AI preference collection via Active Elo enables scalable yet high-fidelity evaluation of editing systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the 16 task types capture core professional needs, similar retrieval pipelines could be applied to create benchmarks for video or 3D editing.
The reliability-weighted Elo updates used here could reduce labeling costs when evaluating other generative systems such as text-to-image or code generation.
Persistent gaps across model classes suggest that simply scaling existing architectures may not close the distance to professional-grade performance.
Integrating explicit verification steps into deployed editing tools could raise reliability without requiring full model retraining.

Load-bearing premise

The CogRetriever pipeline produces instruction pairs that are representative of professional workflows and free of systematic bias in task distribution or image selection.

What would settle it

A single model that scores near ceiling across all 16 task types on CV-Arena and also succeeds on an independent set of real professional editing jobs would falsify the claim of persistent gaps.

Figures

Figures reproduced from arXiv: 2606.00931 by Fangzhou Lin, Haichong Zhang, Kazunori Yamada, Lingyu Xu, Ming-Hsuan Yang, Mingyang Wu, Peiran Li, Qianwen Ge, Shuo Xing, Siyuan Yang, Wenjing Chen, Xiangbo Gao, Zhen Dong, Zhengzhong Tu, Ziming Zhang.

**Figure 1.** Figure 1: The Overall Pipeline. The framework starts with data curation, where CogRetriever constructs a professional-grade dataset. Then, it is followed by model/agent benchmarking and an Active Elo Evaluation, where CV-Judge generates scores and filters outputs using two-gate constraints, while routing ambiguous and high-quality comparisons to human experts. Final rankings are produced through Active Elo with a re… view at source ↗

**Figure 2.** Figure 2: Dataset statistics and User Interface. From left to right: (a) the data composition across different sources; (b) shows the image-resolution distributions; and (c) a zoom-in function to check details during human rating. 3.3 Data Acquisition, Filtering, and Human Verification To collect images satisfying the above criteria at scale, we develop CogRetriever, a Text-Initiated Multimodal Search pipeline with … view at source ↗

**Figure 3.** Figure 3: Average Win Rate Against Top-10 Models with three settings from left to right: Active [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Bootstrap of Elo Estimates (1000 Rounds of Random Sampling) on Top-10 Models with [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison among different editing solutions with reasoning and complex tasks. Our proposed simple baseline CV-Agent consistently produces more faithful, constraintsatisfying edits, preserving non-target content and structure while better following the instruction than strong single-pass editors. 6 Experiments We benchmark a broad set of instructional image editing systems on CV-Arena, includi… view at source ↗

**Figure 6.** Figure 6: Qualitative Comparison Among Different Editing Solutions with low level tasks. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative Comparison Among Different Editing Solutions with failure cases. I Appendix: CV-Judge VLM Backbone Sensitivity CV-Judge is instantiated with GPT-4o as its backbone VLM, primarily due to the favorable balance between evaluation quality and API cost at the scale of CV-Arena. To verify that our findings are not artifacts of this specific choice, we conducted preliminary cross-VLM comparisons using… view at source ↗

read the original abstract

Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow appearance edits and do not fully capture the diversity of real-image tasks in professional workflows. Here, we define instructional computer vision problem solving as a broader formulation of image editing: given a real input image and a natural-language instruction, a system must produce an edited output that realizes the requested transformation while satisfying explicit preservation, geometric, physical, and usability constraints. We introduce CV-Arena, an open benchmark designed to evaluate this capability at professional scales. CV-Arena contains 12K high-resolution real-image instruction pairs spanning 16 instruction-based visual task types, constructed using CogRetriever, a dual-track retrieval-and-curation pipeline that combines targeted web search, agentic query refinement, verification, and traceability. To evaluate models at scale while preserving human fidelity, we propose Active Elo, a human-AI collaborative preference protocol that leverages CV-Judge, a logic-gated, multi-dimensional VLM evaluator, to reject clear failures and resolve high-confidence comparisons; and to route close, high-quality comparisons to expert raters. Mixed human and AI supervision is then aggregated through reliability-weighted Elo updates. Our comprehensive evaluation of 21 systems, including proprietary, open-source, and agentic models, on CV-Arena reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation. We further develop CV-Agent, a lightweight agentic model that combines planning, editing, and verification, and demonstrate that closed-loop reasoning is a promising direction for professional-grade instruction-following visual editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CV-Arena introduces a large benchmark and hybrid evaluation protocol for instruction-guided image editing, but lacks the quantitative checks needed to confirm its quality.

read the letter

The key takeaway is that this paper builds a benchmark for a broader kind of image editing task, one that includes physical and structural constraints, and pairs it with a new way to collect human preferences at scale using AI to handle the obvious cases.

What stands out as new is the CogRetriever pipeline for pulling real images and instructions from the web with agentic refinement, and the Active Elo system that uses a VLM called CV-Judge to gate which comparisons go to humans. They cover 16 task types and test 21 models, including their own CV-Agent that adds planning and verification steps. The paper does a solid job laying out why current benchmarks fall short for professional use and tries to make the data traceable.

The soft spots are around the lack of supporting numbers. The description of the curation does not include rejection rates from the verification step, measures of task diversity, or agreement between raters on the pairs. The stress-test point about possible biases from web search or the VLM verifier is not met with data, so it is hard to know if the failures in physical reasoning and detail preservation are widespread or tied to how the test cases were chosen. The central claims about model gaps rest on this unverified representativeness.

This paper is for researchers who build or evaluate models for instruction-based visual editing. A reader looking for a bigger and more varied test set than existing ones will get some value from the taxonomy and the protocol description.

It deserves a serious referee because the work is substantial enough in scope to benefit from external checks on the data construction and the Elo aggregation method.

I would recommend sending it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces CV-Arena, a benchmark of 12K high-resolution real-image instruction pairs spanning 16 task types for instructional computer vision problem solving (broader than narrow appearance edits, incorporating preservation, geometric, physical, and usability constraints). It describes the CogRetriever dual-track retrieval-and-curation pipeline (targeted web search, agentic refinement, verification, traceability), proposes Active Elo (human-AI collaborative preference protocol using CV-Judge, a logic-gated multi-dimensional VLM evaluator, to route comparisons and aggregate via reliability-weighted Elo), evaluates 21 proprietary/open-source/agentic systems revealing persistent gaps in instruction adherence, physical reasoning, structural control, and detail preservation, and presents CV-Agent (lightweight agentic model with planning/editing/verification) as a promising direction.

Significance. If the benchmark construction and evaluation protocol are validated as representative and reliable, the work supplies a large-scale, open resource that better matches professional image-editing workflows than prior narrow benchmarks; the human-AI hybrid protocol and explicit failure-mode taxonomy could accelerate progress on instruction-following visual models, while the agentic baseline offers a concrete starting point for closed-loop systems.

major comments (2)

[Abstract / CogRetriever pipeline description] Abstract and § on benchmark construction: the headline claim that evaluation of 21 systems 'reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation' is load-bearing for the paper's contribution, yet no per-task rejection statistics, source-domain entropy, inter-annotator agreement on curation decisions, or comparison against an external professional-editing corpus are supplied to establish that CogRetriever pairs are free of systematic bias or representative of professional workflows.
[Active Elo and CV-Judge description] Evaluation protocol section: the soundness of the reported gaps depends on Active Elo and CV-Judge; the manuscript supplies no quantitative validation of pair quality, inter-rater agreement statistics, or ablation studies isolating the contribution of the logic-gated VLM rejection step versus full human rating.

minor comments (2)

[Abstract] The abstract states '12K high-resolution real-image instruction pairs' but does not include a table or figure summarizing the distribution across the 16 task types; adding this would improve reproducibility and allow readers to assess coverage.
[Method overview] Notation for 'CV-Judge' and 'Active Elo' is introduced without an explicit forward reference to their formal definitions or pseudocode; a dedicated subsection or algorithm box would clarify the multi-dimensional scoring logic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. The two major comments highlight important areas for strengthening the validation of both the benchmark construction and the evaluation protocol. We address each point below and commit to revisions that incorporate the requested quantitative evidence where feasible.

read point-by-point responses

Referee: [Abstract / CogRetriever pipeline description] Abstract and § on benchmark construction: the headline claim that evaluation of 21 systems 'reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation' is load-bearing for the paper's contribution, yet no per-task rejection statistics, source-domain entropy, inter-annotator agreement on curation decisions, or comparison against an external professional-editing corpus are supplied to establish that CogRetriever pairs are free of systematic bias or representative of professional workflows.

Authors: We agree that these quantitative details are necessary to fully substantiate the representativeness claim. In the revised manuscript we will add: (i) per-task rejection statistics from the CogRetriever verification stage, (ii) source-domain entropy computed over the 12K pairs, and (iii) inter-annotator agreement (Cohen’s kappa) on the agentic refinement and human verification decisions. For the requested comparison against an external professional-editing corpus, no public corpus matching the 16-task breadth and real-image constraint set currently exists; we will instead provide a qualitative mapping of our task taxonomy to documented professional workflows (e.g., from Adobe, Figma, and photography post-production literature) and report how the 16 categories were derived from those sources. revision: yes
Referee: [Active Elo and CV-Judge description] Evaluation protocol section: the soundness of the reported gaps depends on Active Elo and CV-Judge; the manuscript supplies no quantitative validation of pair quality, inter-rater agreement statistics, or ablation studies isolating the contribution of the logic-gated VLM rejection step versus full human rating.

Authors: We concur that empirical validation of the hybrid protocol is essential. The revised version will include: (i) inter-rater agreement (Fleiss’ kappa) between expert human raters and between humans and CV-Judge on a held-out subset of 500 pairs, (ii) pair-quality metrics (e.g., consistency of preference outcomes across repeated judgments), and (iii) an ablation comparing full-human rating against the logic-gated CV-Judge routing in terms of both agreement with final Elo rankings and annotation cost. These additions will directly quantify the reliability of the reported model gaps. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and empirical evaluation are self-contained

full rationale

The paper introduces CV-Arena as a new benchmark via the CogRetriever pipeline and evaluates 21 systems using Active Elo with CV-Judge; no mathematical derivations, parameter fittings presented as predictions, or self-citation chains appear in the central claims. The construction pipeline and evaluation protocol are described directly without reducing any result to its own inputs by definition. The reader's assessment of score 1.0 aligns with the absence of any load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The central claims rest on the assumption that the new curation and evaluation constructs are valid without external validation data shown in the abstract; no free parameters or mathematical axioms are invoked.

invented entities (3)

CV-Judge no independent evidence
purpose: logic-gated multi-dimensional VLM evaluator for routing comparisons
Introduced as a component of Active Elo; no independent evidence supplied.
Active Elo no independent evidence
purpose: human-AI collaborative preference aggregation protocol
New evaluation method described in abstract; no prior validation cited.
CV-Agent no independent evidence
purpose: lightweight agentic model combining planning, editing, and verification
Proposed system whose performance is reported but not independently verified.

pith-pipeline@v0.9.1-grok · 5878 in / 1264 out tokens · 19396 ms · 2026-06-28T18:35:00.175761+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

90 extracted references · 30 canonical work pages · 14 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952
[3]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18392–18402, 2022

2023
[4]

Instruction-based image manipulation by watching how things move

Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, and Zhihao Xia. Instruction-based image manipulation by watching how things move. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2704–2713, 2025

2025
[6]

Bytemorph: Benchmarking instruction-guided image editing with non-rigid motions.arXiv preprint arXiv:2506.03107, 2025

Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gordon Wetzstein, Mohammad Soleymani, and Peng Wang. Bytemorph: Benchmarking instruction-guided image editing with non-rigid motions.arXiv preprint arXiv:2506.03107, 2025

work page arXiv 2025
[7]

Chen, Misha Sra, and Pradeep Sen

Sherry X. Chen, Misha Sra, and Pradeep Sen. Instruct-clip: Improving instruction-guided image editing with automated data refinement using contrastive learning, 2025

2025
[8]

Hierarchical integration diffusion model for realistic image deblurring.Advances in neural information processing systems, 36:29114–29125, 2023

Zheng Chen, Yulun Zhang, Ding Liu, Jinjin Gu, Linghe Kong, Xin Yuan, et al. Hierarchical integration diffusion model for realistic image deblurring.Advances in neural information processing systems, 36:29114–29125, 2023

2023
[9]

Opengpt-4o-image: A comprehensive dataset for advanced image generation and editing.arXiv preprint arXiv:2509.24900, 2025

Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, et al. Opengpt-4o-image: A comprehensive dataset for advanced image generation and editing.arXiv preprint arXiv:2509.24900, 2025

work page arXiv 2025
[10]

Exploring the naturalness of ai-generated images.arXiv preprint arXiv:2312.05476, 2023

Zijian Chen, Wei Sun, Haoning Wu, Zicheng Zhang, Jun Jia, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, Guangtao Zhai, et al. Exploring the naturalness of ai-generated images.arXiv preprint arXiv:2312.05476, 2023

work page arXiv 2023
[11]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference.arXiv preprint arXiv:2403.04132, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Image harmonization dataset iharmony4: Hcoco, hadobe5k, hflickr, and hday2night.arXiv preprint arXiv:1908.10526, 2019

Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, Zhixin Ling, Weiyuan Li, and Liqing Zhang. Image harmonization dataset iharmony4: Hcoco, hadobe5k, hflickr, and hday2night.arXiv preprint arXiv:1908.10526, 2019. 10

work page arXiv 1908
[14]

Introducing gemini 2.5 flash image, our state-of-the-art image model, 2025

Google DeepMind. Introducing gemini 2.5 flash image, our state-of-the-art image model, 2025

2025
[15]

Introducing nano banana pro, 2025

Google DeepMind. Introducing nano banana pro, 2025

2025
[16]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009
[17]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems, 36:30039–30069, 2023

Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems, 36:30039–30069, 2023

2023
[19]

Image classification based on cnn: a survey.Journal of Cybersecurity and Information Management, 6(1):18–50, 2021

Ahmed A Elngar, Mohamed Arafa, Amar Fathy, Basma Moustafa, Omar Mahmoud, Mohamed Shaban, and Nehal Fawzy. Image classification based on cnn: a survey.Journal of Cybersecurity and Information Management, 6(1):18–50, 2021

2021
[20]

Recognition-synergistic scene text editing

Zhengyao Fang, Pengyuan Lyu, Jingjing Wu, Chengquan Zhang, Jun Yu, Guangming Lu, and Wenjie Pei. Recognition-synergistic scene text editing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13104–13113, 2025

2025
[21]

prentice hall professional technical reference, 2002

David A Forsyth and Jean Ponce.Computer vision: a modern approach. prentice hall professional technical reference, 2002

2002
[22]

Seed-data-edit technical report: A hybrid dataset for instructional image editing.arXiv preprint arXiv:2405.04007, 2024

Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. Seed-data-edit technical report: A hybrid dataset for instructional image editing.arXiv preprint arXiv:2405.04007, 2024

work page arXiv 2024
[23]

Gemini 2.5 pro model card, 2025

Google. Gemini 2.5 pro model card, 2025. Accessed: 2026-01-11

2025
[24]

Multi- reward as condition for instruction-based image editing

Xin Gu, Ming Li, Libo Zhang, Fan Chen, Longyin Wen, Tiejian Luo, and Sijie Zhu. Multi- reward as condition for instruction-based image editing. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[25]

Improving visual and downstream performance of low-light enhancer with vision foun- dation models collaboration

Yuxuan Gu, Haoxuan Wang, Pengyang Ling, Zhixiang Wei, Huaian Chen, Yi Jin, and Enhong Chen. Improving visual and downstream performance of low-light enhancer with vision foun- dation models collaboration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16071–16080, 2025

2025
[26]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Clip and complementary methods.Nature Reviews Methods Primers, 1(1):20, 2021

Markus Hafner, Maria Katsantoni, Tino Köster, James Marks, Joyita Mukherjee, Dorothee Staiger, Jernej Ule, and Mihaela Zavolan. Clip and complementary methods.Nature Reviews Methods Primers, 1(1):20, 2021

2021
[28]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

2021
[29]

Hq-edit: A high-quality dataset for instruction-based image editing.arXiv preprint arXiv:2404.09990, 2024

Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. Hq-edit: A high-quality dataset for instruction-based image editing.arXiv preprint arXiv:2404.09990, 2024

work page arXiv 2024
[30]

Grounding degradations in natural language for all-in-one video restoration

Muhammad Kamran Janjua, Amirhosein Ghasemabadi, Kunlin Zhang, Mohammad Salameh, Chao Gao, and Di Niu. Grounding degradations in natural language for all-in-one video restoration. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5734–5743, 2026. 11

2026
[31]

Compbench: Benchmarking complex instruction-guided image editing.arXiv preprint arXiv:2505.12200, 2025

Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng Feng, Zhouhong Gu, Zhenfei Yin, et al. Compbench: Benchmarking complex instruction-guided image editing.arXiv preprint arXiv:2505.12200, 2025

work page arXiv 2025
[32]

Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, and Wenhu Chen. Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

2024
[33]

A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

Maurice G Kendall. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

1938
[34]

Orida: Object-centric real-world image composition dataset

Jinwoo Kim, Sangmin Han, Jinho Jeong, Jiwoo Choi, Dongyeoung Kim, and Seon Joo Kim. Orida: Object-centric real-world image composition dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3051–3060, 2025

2025
[35]

Peak signal-to-noise ratio revisited: Is simple beautiful? In 2012 Fourth International Workshop on Quality of Multimedia Experience, pages 37–38, 2012

Jari Korhonen and Junyong You. Peak signal-to-noise ratio revisited: Is simple beautiful? In 2012 Fourth International Workshop on Quality of Multimedia Experience, pages 37–38, 2012

2012
[36]

Imagenhub: Standardizing the evaluation of conditional image generation models.arXiv preprint arXiv:2310.01596, 2023

Max Ku, Tianle Li, Kai Zhang, Yujie Lu, Xingyu Fu, Wenwen Zhuang, and Wenhu Chen. Imagenhub: Standardizing the evaluation of conditional image generation models.arXiv preprint arXiv:2310.01596, 2023

work page arXiv 2023
[37]

FLUX.2: Frontier Visual Intelligence

Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025

2025
[38]

The mnist database of handwritten digits.http://yann

Yann LeCun. The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998

1998
[39]

Superedit: Rectifying and facilitating supervision for instruction-based image editing

Ming Li, Xin Gu, Fan Chen, Xiaoying Xing, Longyin Wen, Chen Chen, and Sijie Zhu. Superedit: Rectifying and facilitating supervision for instruction-based image editing. 2025

2025
[40]

Towards benchmarking and assessing visual naturalness of physical world adversarial attacks

Simin Li, Shuning Zhang, Gujun Chen, Dong Wang, Pu Feng, Jiakai Wang, Aishan Liu, Xin Yi, and Xianglong Liu. Towards benchmarking and assessing visual naturalness of physical world adversarial attacks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12324–12333, 2023

2023
[41]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, et al. Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

work page arXiv 2025
[43]

Referring image editing: Object-level image editing via referring expressions

Chang Liu, Xiangtai Li, and Henghui Ding. Referring image editing: Object-level image editing via referring expressions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13128–13138, 2024

2024
[44]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for general image editing.arXiv pre...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language under- standing.arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Visual-instructed degradation diffusion for all-in-one image restoration

Wenyang Luo, Haina Qin, Zewen Chen, Libin Wang, Dandan Zheng, Yuming Li, Yufan Liu, Bing Li, and Weiming Hu. Visual-instructed degradation diffusion for all-in-one image restoration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12764–12777, 2025

2025
[47]

Visual autoregressive modeling for instruction-guided image editing.arXiv preprint, 2025

Qingyang Mao, Qi Cai, Yehao Li, Yingwei Pan, Mingyue Cheng, Ting Yao, Qi Liu, and Tao Mei. Visual autoregressive modeling for instruction-guided image editing.arXiv preprint, 2025. 12

2025
[48]

Introducing manus 1.6: Max performance, mobile dev, and design view, 2025

Manus Meta. Introducing manus 1.6: Max performance, mobile dev, and design view, 2025

2025
[49]

Image segmentation using deep learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(7):3523–3542, 2021

Shervin Minaee, Yuri Boykov, Fatih Porikli, Antonio Plaza, Nasser Kehtarnavaz, and Demetri Terzopoulos. Image segmentation using deep learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(7):3523–3542, 2021

2021
[50]

A comprehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology, 16(5):1–72, 2025

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology, 16(5):1–72, 2025

2025
[51]

Gpt image 1, 2025

OpenAI. Gpt image 1, 2025

2025
[52]

Introducing chatgpt agent: bridging research and action, 2025

OpenAI. Introducing chatgpt agent: bridging research and action, 2025. Accessed: 2026-01-26

2025
[53]

The new chatgpt images is here, 2025

OpenAI. The new chatgpt images is here, 2025

2025
[54]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[55]

The claude 3 model family: Opus, sonnet, haiku

Anthropic PBC. The claude 3 model family: Opus, sonnet, haiku
[56]

Dreambench++: A human-aligned benchmark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

work page arXiv 2024
[57]

Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

work page arXiv 2025
[58]

Ex- ploring stroke-level modifications for scene text editing

Yadong Qu, Qingfeng Tan, Hongtao Xie, Jianjun Xu, Yuxin Wang, and Yongdong Zhang. Ex- ploring stroke-level modifications for scene text editing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 2119–2127, 2023

2023
[59]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[60]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Insert anything: Image insertion via in-context editing in dit

Wensong Song, Hong Jiang, Zongxin Yang, Zheqiao Cheng, Ruijie Quan, and Yi Yang. Insert anything: Image insertion via in-context editing in dit. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 9097–9105, 2026

2026
[62]

Springer Nature, 2022

Richard Szeliski.Computer vision: algorithms and applications. Springer Nature, 2022

2022
[63]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Add- it: Training-free object insertion in images with pretrained diffusion models.arXiv preprint arXiv:2411.07232, 2024

Yoad Tewel, Rinon Gal, Dvir Samuel, Yuval Atzmon, Lior Wolf, and Gal Chechik. Add- it: Training-free object insertion in images with pretrained diffusion models.arXiv preprint arXiv:2411.07232, 2024

work page arXiv 2024
[65]

What makes an image realistic?arXiv preprint arXiv:2403.04493, 2024

Lucas Theis. What makes an image realistic?arXiv preprint arXiv:2403.04493, 2024

work page arXiv 2024
[66]

Deep learning for computer vision: A brief review.Computational intelligence and neuroscience, 2018(1):7068349, 2018

Athanasios V oulodimos, Nikolaos Doulamis, Anastasios Doulamis, and Eftychios Protopa- padakis. Deep learning for computer vision: A brief review.Computational intelligence and neuroscience, 2018(1):7068349, 2018. 13

2018
[67]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Adapting text-to-image generation with feature difference instruction for generic image restoration

Chao Wang, Hehe Fan, Huichen Yang, Sarvnaz Karimi, Lina Yao, and Yi Yang. Adapting text-to-image generation with feature difference instruction for generic image restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23539–23550, 2025

2025
[69]

Complexbench- edit: Benchmarking complex instruction-driven image editing via compositional dependencies

Chenglin Wang, Yucheng Zhou, Qianning Wang, Zhe Wang, and Kai Zhang. Complexbench- edit: Benchmarking complex instruction-driven image editing via compositional dependencies. InProceedings of the 33rd ACM International Conference on Multimedia, pages 13391–13397, 2025

2025
[70]

Vision-zero: Scalable vlm self-improvement via strategic gamified self-play

Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, and Wentian Zhao. Vision-zero: Scalable vlm self-improvement via strategic gamified self-play. arXiv preprint arXiv:2509.25541, 2025

work page arXiv 2025
[71]

Rl-vlm-f: Reinforcement learning from vision language foundation model feedback

Yufei Wang, Zhanyi Sun, Jesse Zhang, Zhou Xian, Erdem Biyik, David Held, and Zackory Erickson. Rl-vlm-f: Reinforcement learning from vision language foundation model feedback. arXiv preprint arXiv:2402.03681, 2024

work page arXiv 2024
[72]

Bovik, H.R

Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

2004
[73]

Objectdrop: Bootstrapping counterfactuals for photorealistic object removal and insertion

Daniel Winter, Matan Cohen, Shlomi Fruchter, Yael Pritch, Alex Rav-Acha, and Yedid Hoshen. Objectdrop: Bootstrapping counterfactuals for photorealistic object removal and insertion. In European Conference on Computer Vision, pages 112–129. Springer, 2024

2024
[74]

Qwen-image technical report, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

2025
[75]

Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu, and Wenhu Chen. Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

work page arXiv 2025
[76]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

2022
[77]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

Anyedit: Mastering unified high-quality image editing for any idea

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26125–26135, 2025

2025
[79]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung- Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022. 14

work page internal anchor Pith review Pith/arXiv arXiv 2022
[80]

Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

2024
[81]

Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

2023

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952

[3] [3]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18392–18402, 2022

2023

[4] [4]

Instruction-based image manipulation by watching how things move

Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, and Zhihao Xia. Instruction-based image manipulation by watching how things move. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2704–2713, 2025

2025

[5] [6]

Bytemorph: Benchmarking instruction-guided image editing with non-rigid motions.arXiv preprint arXiv:2506.03107, 2025

Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gordon Wetzstein, Mohammad Soleymani, and Peng Wang. Bytemorph: Benchmarking instruction-guided image editing with non-rigid motions.arXiv preprint arXiv:2506.03107, 2025

work page arXiv 2025

[6] [7]

Chen, Misha Sra, and Pradeep Sen

Sherry X. Chen, Misha Sra, and Pradeep Sen. Instruct-clip: Improving instruction-guided image editing with automated data refinement using contrastive learning, 2025

2025

[7] [8]

Hierarchical integration diffusion model for realistic image deblurring.Advances in neural information processing systems, 36:29114–29125, 2023

Zheng Chen, Yulun Zhang, Ding Liu, Jinjin Gu, Linghe Kong, Xin Yuan, et al. Hierarchical integration diffusion model for realistic image deblurring.Advances in neural information processing systems, 36:29114–29125, 2023

2023

[8] [9]

Opengpt-4o-image: A comprehensive dataset for advanced image generation and editing.arXiv preprint arXiv:2509.24900, 2025

Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, et al. Opengpt-4o-image: A comprehensive dataset for advanced image generation and editing.arXiv preprint arXiv:2509.24900, 2025

work page arXiv 2025

[9] [10]

Exploring the naturalness of ai-generated images.arXiv preprint arXiv:2312.05476, 2023

Zijian Chen, Wei Sun, Haoning Wu, Zicheng Zhang, Jun Jia, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, Guangtao Zhai, et al. Exploring the naturalness of ai-generated images.arXiv preprint arXiv:2312.05476, 2023

work page arXiv 2023

[10] [11]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference.arXiv preprint arXiv:2403.04132, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [13]

Image harmonization dataset iharmony4: Hcoco, hadobe5k, hflickr, and hday2night.arXiv preprint arXiv:1908.10526, 2019

Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, Zhixin Ling, Weiyuan Li, and Liqing Zhang. Image harmonization dataset iharmony4: Hcoco, hadobe5k, hflickr, and hday2night.arXiv preprint arXiv:1908.10526, 2019. 10

work page arXiv 1908

[13] [14]

Introducing gemini 2.5 flash image, our state-of-the-art image model, 2025

Google DeepMind. Introducing gemini 2.5 flash image, our state-of-the-art image model, 2025

2025

[14] [15]

Introducing nano banana pro, 2025

Google DeepMind. Introducing nano banana pro, 2025

2025

[15] [16]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009

[16] [17]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [18]

Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems, 36:30039–30069, 2023

Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback.Advances in Neural Information Processing Systems, 36:30039–30069, 2023

2023

[18] [19]

Image classification based on cnn: a survey.Journal of Cybersecurity and Information Management, 6(1):18–50, 2021

Ahmed A Elngar, Mohamed Arafa, Amar Fathy, Basma Moustafa, Omar Mahmoud, Mohamed Shaban, and Nehal Fawzy. Image classification based on cnn: a survey.Journal of Cybersecurity and Information Management, 6(1):18–50, 2021

2021

[19] [20]

Recognition-synergistic scene text editing

Zhengyao Fang, Pengyuan Lyu, Jingjing Wu, Chengquan Zhang, Jun Yu, Guangming Lu, and Wenjie Pei. Recognition-synergistic scene text editing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13104–13113, 2025

2025

[20] [21]

prentice hall professional technical reference, 2002

David A Forsyth and Jean Ponce.Computer vision: a modern approach. prentice hall professional technical reference, 2002

2002

[21] [22]

Seed-data-edit technical report: A hybrid dataset for instructional image editing.arXiv preprint arXiv:2405.04007, 2024

Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. Seed-data-edit technical report: A hybrid dataset for instructional image editing.arXiv preprint arXiv:2405.04007, 2024

work page arXiv 2024

[22] [23]

Gemini 2.5 pro model card, 2025

Google. Gemini 2.5 pro model card, 2025. Accessed: 2026-01-11

2025

[23] [24]

Multi- reward as condition for instruction-based image editing

Xin Gu, Ming Li, Libo Zhang, Fan Chen, Longyin Wen, Tiejian Luo, and Sijie Zhu. Multi- reward as condition for instruction-based image editing. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[24] [25]

Improving visual and downstream performance of low-light enhancer with vision foun- dation models collaboration

Yuxuan Gu, Haoxuan Wang, Pengyang Ling, Zhixiang Wei, Huaian Chen, Yi Jin, and Enhong Chen. Improving visual and downstream performance of low-light enhancer with vision foun- dation models collaboration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16071–16080, 2025

2025

[25] [26]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [27]

Clip and complementary methods.Nature Reviews Methods Primers, 1(1):20, 2021

Markus Hafner, Maria Katsantoni, Tino Köster, James Marks, Joyita Mukherjee, Dorothee Staiger, Jernej Ule, and Mihaela Zavolan. Clip and complementary methods.Nature Reviews Methods Primers, 1(1):20, 2021

2021

[27] [28]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

2021

[28] [29]

Hq-edit: A high-quality dataset for instruction-based image editing.arXiv preprint arXiv:2404.09990, 2024

Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. Hq-edit: A high-quality dataset for instruction-based image editing.arXiv preprint arXiv:2404.09990, 2024

work page arXiv 2024

[29] [30]

Grounding degradations in natural language for all-in-one video restoration

Muhammad Kamran Janjua, Amirhosein Ghasemabadi, Kunlin Zhang, Mohammad Salameh, Chao Gao, and Di Niu. Grounding degradations in natural language for all-in-one video restoration. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5734–5743, 2026. 11

2026

[30] [31]

Compbench: Benchmarking complex instruction-guided image editing.arXiv preprint arXiv:2505.12200, 2025

Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng Feng, Zhouhong Gu, Zhenfei Yin, et al. Compbench: Benchmarking complex instruction-guided image editing.arXiv preprint arXiv:2505.12200, 2025

work page arXiv 2025

[31] [32]

Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, and Wenhu Chen. Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

2024

[32] [33]

A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

Maurice G Kendall. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

1938

[33] [34]

Orida: Object-centric real-world image composition dataset

Jinwoo Kim, Sangmin Han, Jinho Jeong, Jiwoo Choi, Dongyeoung Kim, and Seon Joo Kim. Orida: Object-centric real-world image composition dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3051–3060, 2025

2025

[34] [35]

Peak signal-to-noise ratio revisited: Is simple beautiful? In 2012 Fourth International Workshop on Quality of Multimedia Experience, pages 37–38, 2012

Jari Korhonen and Junyong You. Peak signal-to-noise ratio revisited: Is simple beautiful? In 2012 Fourth International Workshop on Quality of Multimedia Experience, pages 37–38, 2012

2012

[35] [36]

Imagenhub: Standardizing the evaluation of conditional image generation models.arXiv preprint arXiv:2310.01596, 2023

Max Ku, Tianle Li, Kai Zhang, Yujie Lu, Xingyu Fu, Wenwen Zhuang, and Wenhu Chen. Imagenhub: Standardizing the evaluation of conditional image generation models.arXiv preprint arXiv:2310.01596, 2023

work page arXiv 2023

[36] [37]

FLUX.2: Frontier Visual Intelligence

Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025

2025

[37] [38]

The mnist database of handwritten digits.http://yann

Yann LeCun. The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998

1998

[38] [39]

Superedit: Rectifying and facilitating supervision for instruction-based image editing

Ming Li, Xin Gu, Fan Chen, Xiaoying Xing, Longyin Wen, Chen Chen, and Sijie Zhu. Superedit: Rectifying and facilitating supervision for instruction-based image editing. 2025

2025

[39] [40]

Towards benchmarking and assessing visual naturalness of physical world adversarial attacks

Simin Li, Shuning Zhang, Gujun Chen, Dong Wang, Pu Feng, Jiakai Wang, Aishan Liu, Xin Yi, and Xianglong Liu. Towards benchmarking and assessing visual naturalness of physical world adversarial attacks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12324–12333, 2023

2023

[40] [41]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [42]

Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, et al. Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

work page arXiv 2025

[42] [43]

Referring image editing: Object-level image editing via referring expressions

Chang Liu, Xiangtai Li, and Henghui Ding. Referring image editing: Object-level image editing via referring expressions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13128–13138, 2024

2024

[43] [44]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for general image editing.arXiv pre...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [45]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language under- standing.arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [46]

Visual-instructed degradation diffusion for all-in-one image restoration

Wenyang Luo, Haina Qin, Zewen Chen, Libin Wang, Dandan Zheng, Yuming Li, Yufan Liu, Bing Li, and Weiming Hu. Visual-instructed degradation diffusion for all-in-one image restoration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12764–12777, 2025

2025

[46] [47]

Visual autoregressive modeling for instruction-guided image editing.arXiv preprint, 2025

Qingyang Mao, Qi Cai, Yehao Li, Yingwei Pan, Mingyue Cheng, Ting Yao, Qi Liu, and Tao Mei. Visual autoregressive modeling for instruction-guided image editing.arXiv preprint, 2025. 12

2025

[47] [48]

Introducing manus 1.6: Max performance, mobile dev, and design view, 2025

Manus Meta. Introducing manus 1.6: Max performance, mobile dev, and design view, 2025

2025

[48] [49]

Image segmentation using deep learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(7):3523–3542, 2021

Shervin Minaee, Yuri Boykov, Fatih Porikli, Antonio Plaza, Nasser Kehtarnavaz, and Demetri Terzopoulos. Image segmentation using deep learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(7):3523–3542, 2021

2021

[49] [50]

A comprehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology, 16(5):1–72, 2025

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology, 16(5):1–72, 2025

2025

[50] [51]

Gpt image 1, 2025

OpenAI. Gpt image 1, 2025

2025

[51] [52]

Introducing chatgpt agent: bridging research and action, 2025

OpenAI. Introducing chatgpt agent: bridging research and action, 2025. Accessed: 2026-01-26

2025

[52] [53]

The new chatgpt images is here, 2025

OpenAI. The new chatgpt images is here, 2025

2025

[53] [54]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022

[54] [55]

The claude 3 model family: Opus, sonnet, haiku

Anthropic PBC. The claude 3 model family: Opus, sonnet, haiku

[55] [56]

Dreambench++: A human-aligned benchmark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

work page arXiv 2024

[56] [57]

Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

work page arXiv 2025

[57] [58]

Ex- ploring stroke-level modifications for scene text editing

Yadong Qu, Qingfeng Tan, Hongtao Xie, Jianjun Xu, Yuxin Wang, and Yongdong Zhang. Ex- ploring stroke-level modifications for scene text editing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 2119–2127, 2023

2023

[58] [59]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[59] [60]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [61]

Insert anything: Image insertion via in-context editing in dit

Wensong Song, Hong Jiang, Zongxin Yang, Zheqiao Cheng, Ruijie Quan, and Yi Yang. Insert anything: Image insertion via in-context editing in dit. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 9097–9105, 2026

2026

[61] [62]

Springer Nature, 2022

Richard Szeliski.Computer vision: algorithms and applications. Springer Nature, 2022

2022

[62] [63]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[63] [64]

Add- it: Training-free object insertion in images with pretrained diffusion models.arXiv preprint arXiv:2411.07232, 2024

Yoad Tewel, Rinon Gal, Dvir Samuel, Yuval Atzmon, Lior Wolf, and Gal Chechik. Add- it: Training-free object insertion in images with pretrained diffusion models.arXiv preprint arXiv:2411.07232, 2024

work page arXiv 2024

[64] [65]

What makes an image realistic?arXiv preprint arXiv:2403.04493, 2024

Lucas Theis. What makes an image realistic?arXiv preprint arXiv:2403.04493, 2024

work page arXiv 2024

[65] [66]

Deep learning for computer vision: A brief review.Computational intelligence and neuroscience, 2018(1):7068349, 2018

Athanasios V oulodimos, Nikolaos Doulamis, Anastasios Doulamis, and Eftychios Protopa- padakis. Deep learning for computer vision: A brief review.Computational intelligence and neuroscience, 2018(1):7068349, 2018. 13

2018

[66] [67]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [68]

Adapting text-to-image generation with feature difference instruction for generic image restoration

Chao Wang, Hehe Fan, Huichen Yang, Sarvnaz Karimi, Lina Yao, and Yi Yang. Adapting text-to-image generation with feature difference instruction for generic image restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23539–23550, 2025

2025

[68] [69]

Complexbench- edit: Benchmarking complex instruction-driven image editing via compositional dependencies

Chenglin Wang, Yucheng Zhou, Qianning Wang, Zhe Wang, and Kai Zhang. Complexbench- edit: Benchmarking complex instruction-driven image editing via compositional dependencies. InProceedings of the 33rd ACM International Conference on Multimedia, pages 13391–13397, 2025

2025

[69] [70]

Vision-zero: Scalable vlm self-improvement via strategic gamified self-play

Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, and Wentian Zhao. Vision-zero: Scalable vlm self-improvement via strategic gamified self-play. arXiv preprint arXiv:2509.25541, 2025

work page arXiv 2025

[70] [71]

Rl-vlm-f: Reinforcement learning from vision language foundation model feedback

Yufei Wang, Zhanyi Sun, Jesse Zhang, Zhou Xian, Erdem Biyik, David Held, and Zackory Erickson. Rl-vlm-f: Reinforcement learning from vision language foundation model feedback. arXiv preprint arXiv:2402.03681, 2024

work page arXiv 2024

[71] [72]

Bovik, H.R

Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

2004

[72] [73]

Objectdrop: Bootstrapping counterfactuals for photorealistic object removal and insertion

Daniel Winter, Matan Cohen, Shlomi Fruchter, Yael Pritch, Alex Rav-Acha, and Yedid Hoshen. Objectdrop: Bootstrapping counterfactuals for photorealistic object removal and insertion. In European Conference on Computer Vision, pages 112–129. Springer, 2024

2024

[73] [74]

Qwen-image technical report, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

2025

[74] [75]

Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu, and Wenhu Chen. Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

work page arXiv 2025

[75] [76]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

2022

[76] [77]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[77] [78]

Anyedit: Mastering unified high-quality image editing for any idea

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26125–26135, 2025

2025

[78] [79]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung- Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022. 14

work page internal anchor Pith review Pith/arXiv arXiv 2022

[79] [80]

Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

2024

[80] [81]

Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

2023