EditCaption: Human-Refined SFT and HAE-DPO for Image Editing Instruction Synthesis

Chao Hui; Haohua Chen; Hao Shi; Honghao Cai; Tianze Zhou; Wei Zhu; Xiangyuan Wang; Xu Tang; Yao Hu; Yibo Chen

arxiv: 2604.08213 · v2 · pith:UI644VF2new · submitted 2026-04-09 · 💻 cs.CV · cs.AI

EditCaption: Human-Refined SFT and HAE-DPO for Image Editing Instruction Synthesis

Xiangyuan Wang , Honghao Cai , Yunhao Bai , Chao Hui , Tianze Zhou , Haohua Chen , Hao Shi , Yuling Wu

show 4 more authors

Yao Hu Xu Tang Yibo Chen Wei Zhu

This is my paper

Pith reviewed 2026-05-10 17:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords image editinginstruction synthesisvision-language modelssupervised fine-tuningdirect preference optimizationhuman alignmentdata generation

0 comments

The pith

A two-stage SFT and DPO pipeline aligns vision-language models to cut critical errors in image editing instructions from 47% to 23%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to remove the data bottleneck for instruction-guided image editing by fixing how vision-language models write editing instructions from image pairs. It identifies three recurring failure modes—orientation mistakes, viewpoint confusion, and vague attribute details—that make over 47% of generated instructions unusable. The proposed EditCaption method first builds a 100K supervised fine-tuning set through automated annotation, filtering, and human fixes, then applies direct preference optimization on 10K human-ranked pairs that target those exact modes. If the approach holds, it supplies a scalable route to high-quality training triplets without requiring full manual captioning for every new editing task.

Core claim

The central claim is that supervised fine-tuning on a 100K human-refined dataset followed by direct preference optimization on 10K targeted preference pairs produces editing instructions whose quality exceeds that of strong baseline VLMs. The resulting fine-tuned Qwen3-VL models reach 4.712 on Eval-400 and 4.588 on ByteMorph-Bench, matching or surpassing Gemini-3-Pro and GPT-4.1, while human raters record critical errors dropping from 47.75% to 23% and correctness rising from 41.75% to 66%.

What carries the argument

EditCaption, a two-stage post-training pipeline that first performs supervised fine-tuning on GLM-annotated and human-refined image-pair instructions, then applies direct preference optimization on human preference pairs explicitly collected for orientation, viewpoint, and attribute errors.

If this is right

Fine-tuned Qwen3-VL models outperform open-source baselines on Eval-400, ByteMorph-Bench, and HQ-Edit.
The 235B model reaches benchmark scores comparable to or higher than Gemini-3-Pro and GPT-4.1.
Human evaluation shows critical errors halved and instruction correctness increased by more than 50%.
The pipeline offers a repeatable method to generate large volumes of human-aligned editing data without full manual annotation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same diagnosis-and-target approach could reduce analogous spatial or attribute errors when synthesizing instructions for video or 3D editing tasks.
Collecting preference data only after identifying concrete failure modes may be more data-efficient than generic alignment methods for creative generation.
Widespread adoption would lower the barrier for open models to serve as reliable data generators for instruction-tuned editing systems.

Load-bearing premise

The three identified failure modes are the dominant sources of unusable instructions and the collected human preference data faithfully captures them without introducing new selection biases.

What would settle it

A controlled experiment that trains identical image-editing models on instructions produced by the fine-tuned VLM versus instructions from an unaligned baseline and measures downstream editing success rate or user preference scores.

Figures

Figures reproduced from arXiv: 2604.08213 by Chao Hui, Haohua Chen, Hao Shi, Honghao Cai, Tianze Zhou, Wei Zhu, Xiangyuan Wang, Xu Tang, Yao Hu, Yibo Chen, Yuling Wu, Yunhao Bai.

**Figure 1.** Figure 1: Overview of our approach. (a) Task & Failure Modes: Given a source-target image pair, VLMs must generate accurate [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of Data Distrubution. Our training dataset is divided into three categories: Semantic Editing (content-based [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The three-step SFT data construction pipeline. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative examples of generated editing instructions. Given the same source-target image pairs, our model produces [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

High-quality source-target image pairs with precise editing instructions are essential for instruction-guided image editing, yet constructing such training triplets at scale remains costly. Recent pipelines often rely on vision-language models to synthesize editing instructions automatically, but we find that strong VLMs still struggle to describe visual transformations between image pairs. In particular, they exhibit three recurring failure modes: orientation inconsistency, viewpoint ambiguity, and missing fine-grained attributes. In a human evaluation on 400 image pairs, several open-source VLM baselines produce critical-error rates above 47\%, making many synthesized instructions unsuitable for downstream training. To address this, we propose EditCaption, a two-stage post-training pipeline for image editing instruction synthesis. First, we construct a 100K supervised fine-tuning dataset through GLM-based auto-captioning, EditScore filtering, and human refinement. Second, we collect 10K human-annotated preference pairs, where each rejected instruction is labeled with its primary error type and severity. Based on this dataset, we propose Hardness-Adaptive Error-Aware DPO (HAE-DPO), a task-adapted DPO objective that introduces an adaptive margin based on human-labeled severity, failure-mode type, and reference-model hardness. Experiments across three benchmarks demonstrate that our 235B model with SFT+HAE-DPO achieves state-of-the-art performance among open-source and closed models, scoring 4.720 on Eval-400, 4.672 on HQ-Edit, and 4.651 on ByteMorph-Bench -- surpassing Gemini-3-Pro on all three. Human evaluation confirms critical error rates drop from 47.75\% to 17.50\%, with correct rates improving from 41.75\% to 70.25\%, surpassing Gemini-3-Pro (66.00\%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EditCaption gives a workable recipe for cutting VLM errors in image editing instructions via targeted SFT and DPO, with clear human-eval gains but limited checks on the data quality.

read the letter

The main thing to know is that this paper cuts critical errors in VLM-generated image editing instructions from 47.75% down to 23% using a two-stage SFT and DPO approach on Qwen models. They identify three failure modes in current VLMs: orientation inconsistency, viewpoint ambiguity, and insufficient fine-grained attributes. Stage one creates 100K SFT data with GLM annotations, EditScore filtering, and human fixes. Stage two uses 10K targeted human preference pairs for DPO. Their fine-tuned models then score well on Eval-400 and ByteMorph-Bench, nearly matching some proprietary models. This is useful work because it gives a concrete method to improve synthetic data quality, backed by both benchmark scores and human judgments showing correctness up to 66%. The soft spots are the lack of ablations on the filtering or stages, no inter-annotator stats, and no explicit check that the preference pairs avoid introducing new biases while targeting those modes. The stress-test point about possible annotation artifacts is fair to raise, though the error reduction is large enough that the pipeline likely adds value. This paper is for people building or scaling instruction-guided image editing systems who need better training triplets. It is practical rather than theoretical. I would recommend sending it for peer review. The results are clear and the problem is important, even if more experimental details would help.

Referee Report

2 major / 2 minor

Summary. The paper claims that VLMs exhibit three systematic failure modes when synthesizing image editing instructions (orientation inconsistency, viewpoint ambiguity, and insufficient fine-grained attribute description), with over 47% of outputs containing critical errors. It introduces EditCaption, a two-stage pipeline: (1) construction of a 100K SFT dataset via GLM annotation, EditScore filtering, and human refinement; (2) collection of 10K human preference pairs targeting the failure modes followed by DPO. Fine-tuned Qwen3-VL models (particularly the 235B variant) outperform open-source baselines and are competitive with closed models on Eval-400 (4.712), ByteMorph-Bench (4.588), and HQ-Edit, while human evaluation shows critical errors dropping from 47.75% to 23% and correctness rising from 41.75% to 66%.

Significance. If the results hold, the work provides a practical, scalable approach to generating human-aligned training data for instruction-guided image editing, addressing a recognized bottleneck. Concrete benchmark scores and human error-rate reductions are reported, and the two-stage SFT+DPO pipeline is a clear methodological contribution that could be adopted by others working on VLM post-training for vision tasks.

major comments (2)

Stage 2 (human preference data collection for DPO): The manuscript states that the 10K pairs 'target the three failure modes' but reports no inter-annotator agreement, no breakdown of pairs by failure mode, and no analysis or controls for annotator biases, selection artifacts, or inconsistent criteria. Because DPO directly optimizes on these pairs, the absence of such validation is load-bearing for the central claim that the pipeline produces instructions with substantially fewer critical errors (47.75% to 23%).
Evaluation section (benchmark and human results): The 235B model reports 4.712 on Eval-400 versus Gemini-3-Pro at 4.706 and 4.588 on ByteMorph-Bench versus 4.522; these margins are small, yet no standard deviations, number of evaluation runs, statistical significance tests, or exact data splits are provided. This makes it difficult to assess whether the reported gains are robust or could be affected by post-hoc choices.

minor comments (2)

Abstract and introduction: The human evaluation protocol (number of annotators, exact scoring rubric, image sampling method) is not summarized, which would help readers interpret the 47.75% to 23% error reduction.
Related work: No explicit comparison is drawn to prior DPO applications in vision-language tasks (e.g., for captioning or VQA), which would clarify the novelty of applying it specifically to editing-instruction synthesis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our work. We address each of the major comments in detail below.

read point-by-point responses

Referee: Stage 2 (human preference data collection for DPO): The manuscript states that the 10K pairs 'target the three failure modes' but reports no inter-annotator agreement, no breakdown of pairs by failure mode, and no analysis or controls for annotator biases, selection artifacts, or inconsistent criteria. Because DPO directly optimizes on these pairs, the absence of such validation is load-bearing for the central claim that the pipeline produces instructions with substantially fewer critical errors (47.75% to 23%).

Authors: We agree that documenting the quality and composition of the preference dataset is crucial for substantiating the DPO improvements. In the revised manuscript, we will provide inter-annotator agreement statistics, a breakdown of the 10K pairs according to the three failure modes, and an analysis of our annotation process including measures to control for biases and ensure consistent criteria. These additions will directly support the validity of the reported error reductions. revision: yes
Referee: Evaluation section (benchmark and human results): The 235B model reports 4.712 on Eval-400 versus Gemini-3-Pro at 4.706 and 4.588 on ByteMorph-Bench versus 4.522; these margins are small, yet no standard deviations, number of evaluation runs, statistical significance tests, or exact data splits are provided. This makes it difficult to assess whether the reported gains are robust or could be affected by post-hoc choices.

Authors: We acknowledge the need for more rigorous statistical reporting to contextualize the benchmark results. We will revise the evaluation section to include standard deviations computed over multiple runs, the number of evaluation runs performed, results of statistical significance tests, and precise descriptions of the data splits. Although the numerical margins on the automatic metrics are modest, the human evaluation demonstrates a substantial and consistent improvement in instruction quality, which we believe strengthens the overall conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline evaluated on independent benchmarks

full rationale

The paper describes a two-stage empirical pipeline (SFT on 100K dataset followed by DPO on 10K human preference pairs) without any mathematical derivations, equations, or self-definitional reductions. Performance is measured on separate external benchmarks (Eval-400, ByteMorph-Bench, HQ-Edit) and fresh human evaluations that are independent of the training data. No load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior work appear in the text. The central claims rest on standard supervised and preference optimization steps whose outputs are validated externally rather than by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that human preference judgments reliably identify and correct the listed failure modes and that the chosen benchmarks are representative proxies for real-world editing instruction quality.

axioms (2)

domain assumption Human annotators can consistently identify and correct orientation, viewpoint, and attribute-level errors in image editing instructions
Invoked when constructing the 10K preference pairs and when claiming error-rate reductions
domain assumption SFT followed by DPO on VLM outputs will generalize beyond the collected preference data to new image pairs
Underlying the claim that fine-tuned models outperform baselines on held-out benchmarks

pith-pipeline@v0.9.0 · 5643 in / 1504 out tokens · 53263 ms · 2026-05-10T17:18:58.364884+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage post-training pipeline... 100K supervised fine-tuning (SFT) dataset... 10K human preference pairs... Direct Preference Optimization (DPO)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

orientation inconsistency, viewpoint ambiguity, and insufficient fine-grained attribute description

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.