arxiv: 2504.17761 · v5 · submitted 2025-04-24 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu , Yucheng Han , Peng Xing , Fukun Yin , Rui Wang , Wei Cheng , Jiaqi Liao , Yingming Wang

show 16 more authors

Honghao Fu Chunrui Han Guopeng Li Yuang Peng Quan Sun Jingwei Wu Yan Cai Zheng Ge Ranchen Ming Lei Xia Xianfang Zeng Yibo Zhu Binxing Jiao Xiangyu Zhang Gang Yu Daxin Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 14:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords image editingmultimodal LLMdiffusion decoderopen-source modelGEdit-Benchdata generation pipelineimage manipulationgenerative models

0 comments

The pith

Step1X-Edit is an open-source image editing model that approaches the performance of closed-source systems such as GPT-4o on real-world tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Step1X-Edit as a practical open-source framework for general image editing. It processes a reference image together with a natural-language editing instruction through a multimodal large language model, extracts a latent embedding from that combination, and feeds the embedding into a diffusion decoder to produce the edited output. A custom data generation pipeline supplies the training examples, while GEdit-Bench supplies an evaluation set drawn from actual user requests. Experiments show the resulting model exceeds other open-source editors by a wide margin and reaches levels close to leading proprietary systems.

Core claim

Step1X-Edit combines a multimodal LLM with a diffusion image decoder to perform general-purpose image editing. The model is trained on data produced by a dedicated generation pipeline and evaluated on GEdit-Bench, a benchmark constructed from real-world user instructions. On this benchmark the system substantially outperforms existing open-source baselines and approaches the editing quality of closed-source models such as GPT-4o and Gemini2 Flash.

What carries the argument

Multimodal LLM that ingests the reference image and editing instruction to produce a latent embedding, which is then passed to a diffusion image decoder for final output generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same LLM-plus-diffusion pattern could be adapted to video or 3-D asset editing by swapping the decoder backbone.
Open release of both model and benchmark may encourage community fine-tuning on domain-specific editing tasks such as product photography or medical imagery.
The data pipeline itself offers a template for synthesizing large-scale instruction-following datasets without manual labeling.

Load-bearing premise

The data generation pipeline produces high-quality and diverse examples that let the model generalize to arbitrary real-world instructions, and GEdit-Bench accurately reflects practical editing needs.

What would settle it

A comparison on a fresh collection of user instructions outside GEdit-Bench in which Step1X-Edit falls markedly below GPT-4o quality would falsify the claim of approaching proprietary performance.

read the original abstract

In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Step1X-Edit is a straightforward MLLM-plus-diffusion engineering job with a new benchmark and data pipeline, but the claimed gains over open baselines still need numbers and validation to hold up.

read the letter

The paper's core move is to route user instructions and reference images through a multimodal LLM to produce a latent embedding, then decode that with a diffusion model. They add a custom data generation pipeline for training and release GEdit-Bench, built from real user instructions, for testing. That combination is what they put forward as closing most of the gap to GPT-4o and Gemini 2 Flash while beating open-source alternatives.

Referee Report

3 major / 2 minor

Summary. The paper introduces Step1X-Edit, an image editing framework that processes a reference image and user instruction via a Multimodal LLM, extracts a latent embedding, and combines it with a diffusion decoder to produce the edited output. It describes a custom data generation pipeline for creating training data and introduces GEdit-Bench, a benchmark derived from real-world user instructions. The central claim is that Step1X-Edit substantially outperforms existing open-source baselines on GEdit-Bench while approaching the performance of proprietary models such as GPT-4o and Gemini 2 Flash.

Significance. If the performance claims hold with proper validation, the work would be significant by providing an open-source image editing model that narrows the gap with leading closed-source systems and by releasing GEdit-Bench as a new resource grounded in practical editing needs.

major comments (3)

[Abstract] Abstract: the assertion that 'experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin' is unsupported by any quantitative metrics, tables, scores, or figures, which is load-bearing for the central claim and prevents evaluation of the reported margin.
[Data generation pipeline] Data generation pipeline section: the pipeline is presented as producing 'high-quality' and 'diverse' examples that enable generalization to real-world instructions, yet no validation metrics (human preference scores, diversity statistics, or cross-benchmark transfer results) are supplied; this assumption directly determines whether the outperformance reflects architectural merit or distribution match.
[Evaluation] Evaluation section: no training hyperparameters, model integration details for the latent embedding with the diffusion decoder, ablation studies, or error analysis are reported, leaving the contributions of the MLLM-latent-diffusion design unassessable and the reproducibility of the GEdit-Bench results unclear.

minor comments (2)

[Abstract] The model name 'Gemini2 Flash' should be standardized to the official nomenclature (e.g., Gemini 2.0 Flash) for precision.
[Abstract] The abstract states there is 'still a large gap' between open-source and closed-source models but does not quantify this gap or cite specific prior open-source baselines, which would improve context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below and will revise the manuscript to strengthen the presentation of results, validation, and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that 'experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin' is unsupported by any quantitative metrics, tables, scores, or figures, which is load-bearing for the central claim and prevents evaluation of the reported margin.

Authors: We agree that the abstract would be strengthened by including concrete quantitative support. In the revised version we will add specific metrics (e.g., average GEdit-Bench scores or win rates versus open-source baselines and proprietary models) drawn from the experimental tables already present in the main body. revision: yes
Referee: [Data generation pipeline] Data generation pipeline section: the pipeline is presented as producing 'high-quality' and 'diverse' examples that enable generalization to real-world instructions, yet no validation metrics (human preference scores, diversity statistics, or cross-benchmark transfer results) are supplied; this assumption directly determines whether the outperformance reflects architectural merit or distribution match.

Authors: The manuscript describes the pipeline construction in detail, but we acknowledge the absence of explicit validation statistics. We will add a short subsection reporting diversity statistics (instruction-type distribution and semantic coverage) and human preference scores on a held-out sample of generated pairs. Cross-benchmark transfer results will be included if they can be computed without additional experiments. revision: partial
Referee: [Evaluation] Evaluation section: no training hyperparameters, model integration details for the latent embedding with the diffusion decoder, ablation studies, or error analysis are reported, leaving the contributions of the MLLM-latent-diffusion design unassessable and the reproducibility of the GEdit-Bench results unclear.

Authors: We apologize for the omission of these details in the main text. Key training hyperparameters and the precise integration mechanism between the MLLM latent embedding and the diffusion decoder will be moved from the appendix into the Evaluation section. We will also add ablation studies isolating the MLLM and diffusion components together with a concise error analysis of failure cases on GEdit-Bench to improve both interpretability and reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity in model description or evaluation claims

full rationale

The paper presents a practical image-editing framework that combines an MLLM for instruction processing with a latent diffusion decoder. Training relies on a separately constructed data-generation pipeline and evaluation uses the independently developed GEdit-Bench benchmark. No mathematical derivation, first-principles prediction, or self-referential fitting step is claimed or exhibited; performance margins are reported as experimental outcomes on the benchmark rather than quantities forced by construction from the training data or model architecture. No self-citation load-bearing, ansatz smuggling, or renaming of known results appears in the abstract or described sections.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions from multimodal learning and diffusion modeling; no new entities are postulated and free parameters are the usual training hyperparameters left unspecified in the abstract.

free parameters (1)

training hyperparameters
Standard model size, learning rate, and optimization choices required for any such system but not detailed in the abstract.

axioms (2)

domain assumption Multimodal LLMs can extract useful editing instructions from paired image-text inputs
Invoked when the abstract states the LLM processes the reference image and editing instruction to produce a latent embedding.
domain assumption Diffusion decoders can faithfully realize edits from latent embeddings
Assumed in the integration step that produces the target image.

pith-pipeline@v0.9.0 · 5606 in / 1326 out tokens · 81143 ms · 2026-05-11T14:31:05.204395+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we adopt the Multimodal LLM to process the reference image and the user’s editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we build a data generation pipeline covering 11 editing tasks to produce a high-quality dataset
Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 54 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Masked Generative Transformer Is What You Need for Image Editing
cs.CV 2026-05 unverdicted novelty 8.0

EditMGT applies masked generative transformers with attention consolidation and region-hold sampling to deliver state-of-the-art localized image editing at 6x the speed of diffusion methods.
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
cs.CV 2026-05 unverdicted novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
Inline Critic Steers Image Editing
cs.CV 2026-05 conditional novelty 7.0

Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.
G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models
cs.CV 2026-05 unverdicted novelty 7.0

G²TR reduces visual tokens and prefill computation by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency while preserving reasoning accuracy and editing quality.
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 7.0

UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.
Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing
cs.CR 2026-05 unverdicted novelty 7.0

Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
cs.CV 2026-05 unverdicted novelty 7.0

EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching
cs.CV 2026-05 unverdicted novelty 7.0

FlowDIS uses flow matching to transport image distributions to mask distributions, optionally conditioned on text, and outperforms prior DIS methods by 5.5% on F_beta^omega and 43% on MAE.
MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.
DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

DirectEdit achieves step-level accurate inversion for flow-based image editing by directly aligning forward paths, using attention feature injection and mask-guided noise blending to balance fidelity and editability w...
SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking
cs.CV 2026-05 unverdicted novelty 7.0

SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.
Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing
cs.CV 2026-04 unverdicted novelty 7.0

A co-trained adapter framework enables mask-free local editing in DiTs by factorizing edit semantics from spatial location and jointly learning a mask predictor.
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
cs.CV 2026-04 unverdicted novelty 7.0

HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 7.0

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
AIM-Bench: Benchmarking and Improving Affective Image Manipulation via Fine-Grained Hierarchical Control
cs.CV 2026-04 unverdicted novelty 7.0

AIM-Bench is the first dedicated benchmark for editing images to evoke specific emotions with fine-grained control, paired with AIM-40k dataset that delivers a 9.15% performance gain by correcting training data imbalances.
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details
cs.CV 2026-04 unverdicted novelty 7.0

RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.
Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off
cs.CV 2026-03 unverdicted novelty 7.0

Dress-ED is the first large-scale benchmark unifying virtual try-on, try-off, and text-guided garment editing with 146k verified samples plus a multimodal diffusion baseline.
Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation
cs.CV 2026-05 unverdicted novelty 6.0

Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
GeoR-Bench: Evaluating Geoscience Visual Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

GeoR-Bench shows top multimodal models reach only 42.7% strict accuracy on geoscience visual reasoning tasks while open-source models reach 10.3%, with outputs often visually plausible yet scientifically inaccurate.
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
cs.CV 2026-05 unverdicted novelty 6.0

A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
cs.AI 2026-05 unverdicted novelty 6.0

Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...
FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching
cs.CV 2026-05 unverdicted novelty 6.0

FlowDIS uses flow matching to transport image distributions to mask distributions with language guidance and PAIP training, outperforming prior DIS methods by 5.5% on F_beta^omega and 43% on MAE on DIS-TE.
MooD: Perception-Enhanced Efficient Affective Image Editing via Continuous Valence-Arousal Modeling
cs.CV 2026-05 unverdicted novelty 6.0

MooD introduces continuous valence-arousal modeling with VA-aware retrieval and perception-enhanced guidance for efficient, controllable affective image editing, plus a new AffectSet dataset.
MooD: Perception-Enhanced Efficient Affective Image Editing via Continuous Valence-Arousal Modeling
cs.CV 2026-05 unverdicted novelty 6.0

MooD is the first framework to use continuous Valence-Arousal values for fine-grained affective image editing via a VA-aware retrieval strategy, visual transfer, semantic guidance, and the new AffectSet dataset.
Leveraging Verifier-Based Reinforcement Learning in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
cs.CV 2026-04 unverdicted novelty 6.0

SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
MeshLAM: Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

MeshLAM reconstructs high-fidelity animatable textured mesh head avatars from a single image via a feed-forward dual shape-texture architecture with iterative GRU decoding and reprojection-based guidance.
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
cs.CV 2026-04 unverdicted novelty 6.0

LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Task-aware localization via attention cues and feature centroids from source/target streams in IIE models improves non-edit consistency while preserving instruction following.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 6.0

UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing
cs.CV 2026-04 unverdicted novelty 6.0

LIVE achieves state-of-the-art instruction-based video editing by jointly training on image and video data with a frame-wise token noise strategy to bridge domain gaps and a new benchmark of over 60 tasks.
InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
cs.CV 2026-04 unverdicted novelty 6.0

InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing
cs.CV 2026-04 unverdicted novelty 6.0

SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
cs.GR 2025-06 unverdicted novelty 6.0

FLUX.1 Kontext unifies image generation and editing via flow matching and sequence concatenation, delivering improved multi-turn consistency and speed on the new KontextBench benchmark.
ImgEdit: A Unified Image Editing Dataset and Benchmark
cs.CV 2025-05 conditional novelty 6.0

ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
cs.CV 2026-05 unverdicted novelty 5.0

Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
DataEvolver: Let Your Data Build and Improve Itself via Goal-Driven Loop Agents
cs.AI 2026-05 unverdicted novelty 5.0

DataEvolver introduces a reusable framework with generation-time self-correction and validation-time self-expansion loops that improves visual datasets, shown to outperform baselines on an object-rotation task.
Context Unrolling in Omni Models
cs.CV 2026-04 unverdicted novelty 5.0

Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing
cs.CV 2026-04 unverdicted novelty 5.0

SmartPhotoCrafter performs automatic photographic image editing by coupling an Image Critic module that identifies deficiencies with a Photographic Artist module that generates edits, trained via multi-stage pretraini...
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance
cs.CV 2026-04 unverdicted novelty 5.0

FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models ...
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
cs.CV 2025-11 unverdicted novelty 5.0

Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...
Qwen-Image Technical Report
cs.CV 2025-08 unverdicted novelty 5.0

Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
cs.CV 2025-06 unverdicted novelty 5.0

UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
Emerging Properties in Unified Multimodal Pretraining
cs.CV 2025-05 unverdicted novelty 5.0

BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
simpleposter: a simple baseline for product poster generation
cs.CV 2026-05 unverdicted novelty 4.0

SimplePoster achieves 98.7% subject preservation and improved text accuracy in product posters via full-parameter fine-tuning of an inpainting model and zero-cost character-level position encoding, outperforming compl...
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
cs.GR 2026-05 unverdicted novelty 4.0

JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
cs.CV 2026-05 unverdicted novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
cs.AI 2026-04 unverdicted novelty 4.0

TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · cited by 50 Pith papers · 4 internal anchors

[1]

Stable diffusion 3.5

Stability AI. Stable diffusion 3.5. https://huggingface.co/stabilityai/stable-diffusion-3. 5-large, 2024. Accessed: 2025-04-17

work page 2024
[2]

Humanedit: A high-quality human-rewarded dataset for instruction-based image editing

Jinbin Bai, Wei Chow, Ling Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, and Shuicheng Yan. Humanedit: A high-quality human-rewarded dataset for instruction-based image editing. arXiv preprint arXiv:2412.04280, 2024

work page arXiv 2024
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023

work page internal anchor Pith review arXiv 2023
[5]

Flux.1 [dev]

Black Forest Labs. Flux.1 [dev]. https://huggingface.co/black-forest-labs/FLUX.1-dev , 2024

work page 2024
[6]

Flux.1 fill [dev]

Black Forest Labs. Flux.1 fill [dev]. https://huggingface.co/black-forest-labs/FLUX. 1-Fill-dev, 2024. Accessed: 2025-04-19

work page 2024
[7]

Flux.1 [schnell]

Black Forest Labs. Flux.1 [schnell]. https://huggingface.co/black-forest-labs/FLUX. 1-schnell, 2024

work page 2024
[8]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18392–18402, 2022

work page 2023
[9]

Multimodal representation alignment for image generation: Text-image interleaved control is easier than you think, 2025

Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, and Baobao Chang. Multimodal representation alignment for image generation: Text-image interleaved control is easier than you think, 2025

work page 2025
[10]

arXiv preprint arXiv:2009.09941 , year=

Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, and Haoshuang Wang. Pp-ocr: A practical ultra lightweight ocr system. arXiv preprint arXiv:2009.09941, 2020

work page arXiv 2009
[11]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning, 2024

work page 2024
[12]

Unified autoregressive visual generation and understanding with continuous tokens.arXiv preprint arXiv:2503.13436, 2025

Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, et al. Unified autoregressive visual generation and understanding with continuous tokens. arXiv preprint arXiv:2503.13436, 2025

work page arXiv 2025
[13]

Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing.arXiv preprint arXiv:2503.10639, 2025a

Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Xihui Liu, and Hongsheng Li. Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing. arXiv preprint arXiv:2503.10639, 2025

work page arXiv 2025
[14]

Seed-data-edit technical report: A hybrid dataset for instructional image editing

Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. Seed-data-edit technical report: A hybrid dataset for instructional image editing. arXiv preprint arXiv:2405.04007, 2024. 14

work page arXiv 2024
[15]

Experiment with gemini 2.0 flash native image generation, 2025

Google Gemini2. Experiment with gemini 2.0 flash native image generation, 2025

work page 2025
[16]

et al.\ (2025)

Zhen Han, Zeyinzi Jiang, Yulin Pan, Jingfeng Zhang, Chaojie Mao, Chenwei Xie, Yu Liu, and Jingren Zhou. Ace: All-round creator and editor following instructions via diffusion transformer. arXiv preprint arXiv:2410.00086, 2024

work page arXiv 2024
[17]

Hidream-e1

HiDream-ai. Hidream-e1. https://github.com/HiDream-ai/HiDream-E1, 2025

work page 2025
[18]

Hidream-i1

HiDream-ai. Hidream-i1. https://github.com/HiDream-ai/HiDream-I1, 2025

work page 2025
[19]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[20]

Instruct-imagen: Image generation with multi-modal instruction

Hexiang Hu, Kelvin CK Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, et al. Instruct-imagen: Image generation with multi-modal instruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4754–4763, 2024

work page 2024
[21]

Smartedit: Exploring complex instruction-based image editing with multimodal large language models

Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, and Ying Shan. Smartedit: Exploring complex instruction-based image editing with multimodal large language models. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8362–8371, 2024

work page 2024
[22]

arXiv preprint arXiv:2404.09990 , year=

Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. Hq-edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990, 2024

work page arXiv 2024
[23]

Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion

Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, and Qiang Xu. Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. In European Conference on Computer Vision, pages 150–168. Springer, 2024

work page 2024
[24]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Viescore: Towards explainable metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. arXiv preprint arXiv:2312.14867, 2023

work page arXiv 2023
[26]

Black Forest Labs. Flux. https://github.com/black-forest-labs/flux, 2024

work page 2024
[27]

Controlvar: Exploring con- trollable visual autoregressive modeling.arXiv preprint arXiv:2406.09750, 2024

Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Zhe Lin, Rita Singh, and Bhiksha Raj. Controlvar: Exploring controllable visual autoregressive modeling. arXiv preprint arXiv:2406.09750, 2024

work page arXiv 2024
[28]

Brushedit: All-in-one image inpainting and editing

Yaowei Li, Yuxuan Bian, Xu Ju, Zhaoyang Zhang, Ying Shan, and Qiang Xu. Brushedit: All-in-one image inpainting and editing. ArXiv, abs/2412.10316, 2024

work page arXiv 2024
[29]

Controlar: Controllable image generation with autoregressive models

Zongming Li, Tianheng Cheng, Shoufa Chen, Peize Sun, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Controlar: Controllable image generation with autoregressive models. In International Conference on Learning Representations, 2025

work page 2025
[30]

Objectremovalalpha dataset

lrzjason. Objectremovalalpha dataset. https://huggingface.co/datasets/lrzjason/ ObjectRemovalAlpha, 2025. Accessed: 2025-04-19

work page 2025
[31]

Qwen2vl-flux: Unifying image and text guidance for controllable image generation, 2024

Pengqi Lu. Qwen2vl-flux: Unifying image and text guidance for controllable image generation, 2024

work page 2024
[32]

Exploring the role of large language models in prompt encoding for diffusion models

Bingqi Ma, Zhuofan Zong, Guanglu Song, Hongsheng Li, and Yu Liu. Exploring the role of large language models in prompt encoding for diffusion models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[33]

Lu, and Zhenyu Yang

Jiancang Ma, Qirong Peng, Xu Guo, Chen Chen, H. Lu, and Zhenyu Yang. X2i: Seamless integration of multimodal understanding into diffusion transformer via attention distillation. ArXiv, abs/2503.06134, 2025

work page arXiv 2025
[34]

Ace++: Instruction- based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025

Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction-based image creation and editing via context-aware content filling. arXiv preprint arXiv:2501.02487, 2025

work page arXiv 2025
[35]

Superedit: Rectifying and facilitating supervision for instruction-based image editing

Li Ming, Gu Xin, Chen Fan, Xing Xiaoying, Wen Longyin, Chen Chen, and Zhu Sijie. Superedit: Rectifying and facilitating supervision for instruction-based image editing. arXiv preprint arXiv:2505.02370, 2025

work page arXiv 2025
[36]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Jing Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. ArXiv, abs/2302.08453, 2023

work page arXiv 2023
[37]

Introducing 4o image generation, 2025

OpenAI. Introducing 4o image generation, 2025

work page 2025
[38]

Flex.2-preview

ostris. Flex.2-preview. https://huggingface.co/ostris/Flex.2-preview, 2025. 15

work page 2025
[39]

Transfer between Modalities with MetaQueries

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries. arXiv preprint arXiv:2504.06256, 2025

work page internal anchor Pith review arXiv 2025
[40]

Ice-bench: A unified and comprehensive benchmark for image creating and editing

Yulin Pan, Xiangteng He, Chaojie Mao, Zhen Han, Zeyinzi Jiang, Jingfeng Zhang, and Yu Liu. Ice-bench: A unified and comprehensive benchmark for image creating and editing. arXiv preprint arXiv:2503.14482, 2025

work page arXiv 2025
[41]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[42]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[43]

Lumina-omnilv: A unified multimodal framework for general low-level vision

Yuandong Pu, Le Zhuo, Kaiwen Zhu, Liangbin Xie, Wenlong Zhang, Xiangyu Chen, Pneg Gao, Yu Qiao, Chao Dong, and Yihao Liu. Lumina-omnilv: A unified multimodal framework for general low-level vision. arXiv preprint arXiv:2504.04903, 2025

work page arXiv 2025
[44]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[45]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020

work page 2020
[46]

SAM 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. In The Thirteenth Internatio...

work page 2025
[47]

Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021

work page 2022
[48]

Many-to-many image generation with auto-regressive diffusion models

Ying Shen, Yizhe Zhang, Shuangfei Zhai, Lifu Huang, Joshua M Susskind, and Jiatao Gu. Many-to-many image generation with auto-regressive diffusion models. arXiv preprint arXiv:2404.03109, 2024

work page arXiv 2024
[49]

Emu edit: Precise image editing via recognition and generation tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024

work page 2024
[50]

Seededit: Align image re-generation to image editing

Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing. arXiv preprint arXiv:2411.06686, 2024

work page arXiv 2024
[51]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. International Conference on Learning Representations (ICLR), 2021

work page 2021
[52]

step-1o-turbo-vision

StepFun. step-1o-turbo-vision. https://platform.stepfun.com/, 2025

work page 2025
[53]

Ominicontrol: Minimal and uni- versal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer. arXiv preprint arXiv:2411.15098, 2024

work page arXiv 2024
[54]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020

work page 2020
[55]

Instructedit: Improving automatic masks for diffusion-based image editing with user instructions

Qian Wang, Biao Zhang, Michael Birsak, and Peter Wonka. Instructedit: Improving automatic masks for diffusion-based image editing with user instructions. ArXiv, abs/2305.18047, 2023

work page arXiv 2023
[56]

Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content, 2024

Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, Fei Yang, Pengfei Wan, and Di Zhang. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content, 2024

work page 2024
[57]

Imagen editor and editbench: Advancing and evaluating text-guided image inpainting

Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont-Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah Laszlo, David J Fleet, Radu Soricut, et al. Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18359–18369, 2023

work page 2023
[58]

Training- free text-guided image editing with visual autoregressive model

Yufei Wang, Lanqing Guo, Zhihao Li, Jiaxing Huang, Pichao Wang, Bihan Wen, and Jian Wang. Training- free text-guided image editing with visual autoregressive model. arXiv preprint arXiv:2503.23897, 2025

work page arXiv 2025
[59]

Om- niedit: Building image editing generalist models through specialist supervision

Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, and Wenhu Chen. Omniedit: Building image editing generalist models through specialist supervision. arXiv preprint arXiv:2411.07199, 2024. 16

work page arXiv 2024
[60]

Florence-2: Advancing a unified representation for a variety of vision tasks

Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4818–4829, 2024

work page 2024
[61]

Omnigen: Unified image generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shut- ing Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. arXiv preprint arXiv:2409.11340, 2024

work page arXiv 2024
[62]

arXiv preprint arXiv:2401.11708 (2024) 5

Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, and Bin Cui. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. ArXiv, abs/2401.11708, 2024

work page arXiv 2024
[63]

Car: Controllable autoregressive modeling for visual generation

Ziyu Yao, Jialin Li, Yifeng Zhou, Yong Liu, Xi Jiang, Chengjie Wang, Feng Zheng, Yuexian Zou, and Lei Li. Car: Controllable autoregressive modeling for visual generation. arXiv preprint arXiv:2410.04671, 2024

work page arXiv 2024
[64]

Anyedit: Mastering unified high-quality image editing for any idea

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. arXiv preprint arXiv:2411.15738, 2024

work page arXiv 2024
[65]

Promptfix: You prompt and we fix the photo

Yongsheng Yu, Ziyun Zeng, Hang Hua, Jianlong Fu, and Jiebo Luo. Promptfix: You prompt and we fix the photo. arXiv preprint arXiv:2405.16785, 2024

work page arXiv 2024
[66]

Magicbrush: A manually annotated dataset for instruction-guided image editing

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Processing Systems, 36:31428–31449, 2023

work page 2023
[67]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3813–3824, 2023

work page 2023
[68]

Hive: Harnessing human feedback for instructional visual editing

Shu Zhang, Xinyi Yang, Yihao Feng, Can Qin, Chia-Chih Chen, Ning Yu, Zeyuan Chen, Huan Wang, Silvio Savarese, Stefano Ermon, et al. Hive: Harnessing human feedback for instructional visual editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9026–9036, 2024

work page 2024
[69]

Recognize anything: A strong image tagging model

Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, et al. Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514, 2023

work page arXiv 2023
[70]

In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer. arXiv, 2025

work page 2025
[71]

Ultraedit: Instruction-based fine-grained image editing at scale

Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems, 37:3058–3093, 2024

work page 2024
[72]

Bilateral reference for high-resolution dichotomous image segmentation

Peng Zheng, Dehong Gao, Deng-Ping Fan, Li Liu, Jorma Laaksonen, Wanli Ouyang, and Nicu Sebe. Bilateral reference for high-resolution dichotomous image segmentation. CAAI Artificial Intelligence Research, 3:9150038, 2024

work page 2024
[73]

A task is worth one word: Learning with task prompts for high-quality versatile image inpainting

Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, and Kai Chen. A task is worth one word: Learning with task prompts for high-quality versatile image inpainting. In European Conference on Computer Vision, pages 195–211. Springer, 2024

work page 2024
[74]

Se\˜ norita-2m: A high-quality instruction-based dataset for general video editing by video specialists.arXiv preprint arXiv:2502.06734, 2025

Bojia Zi, Penghui Ruan, Marco Chen, Xianbiao Qi, Shaozhe Hao, Shihao Zhao, Youze Huang, Bin Liang, Rong Xiao, and Kam-Fai Wong. Señorita-2m: A high-quality instruction-based dataset for general video editing by video specialists. arXiv preprint arXiv:2502.06734, 2025. 17

work page arXiv 2025