pith. sign in

arxiv: 2605.21487 · v1 · pith:YGRVTMJSnew · submitted 2026-05-20 · 💻 cs.CV

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Pith reviewed 2026-05-21 04:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords unified multimodal modelsimage editingmodel tuningdata synthesisvisual question answeringmultimodal capabilitiesintelligent editingtask conflicts
0
0 comments X

The pith

Image editing serves as a single general task to enhance understanding, generation, and editing in unified multimodal models

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current approaches to training unified multimodal models rely on mixing multiple tasks, which creates conflicts and demands complex balancing and multi-stage pipelines. It proposes instead that intelligent image editing is a natural general task because it requires both visual understanding and generation to succeed. To realize this, the authors introduce an automated pipeline that converts VQA data into complex editing instructions containing embedded questions and nested logic, yielding a 148k-example dataset. Experiments on two models show that training on this single dataset and single task produces gains across understanding, generation, and editing without any auxiliary data or operations.

Core claim

Tuning solely on the Uni-Edit task, using a dataset of 148k examples with complex reasoning-intensive editing instructions derived from VQA data via an automated synthesis pipeline, achieves comprehensive enhancements across image understanding, generation, and editing capabilities with only one task, one training stage, and one dataset.

What carries the argument

The automated scalable data synthesis pipeline that transforms diverse VQA data into complex editing instructions with embedded questions and nested logic, producing the Uni-Edit-148k dataset that pairs these instructions with high-quality edited images.

If this is right

  • Unified multimodal models can reach multi-capability performance through single-task training on an integrative task like intelligent editing.
  • Task conflicts that arise in mixed multi-task training can be avoided by selecting a task that inherently couples understanding and generation.
  • An automated pipeline enables scalable creation of reasoning-heavy editing data without manual curation of instructions.
  • Performance improvements across all three capabilities occur without multi-stage pipelines or auxiliary data balancing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The synthesis method could be extended to create training data for other integrative tasks such as video or 3D editing.
  • If the gains stem from the reasoning structure of the instructions, similar pipelines might improve reasoning in non-editing multimodal benchmarks.
  • Single-task tuning on editing might reduce the overall data volume needed to reach competitive performance in unified models.
  • The approach raises the question of whether other cross-modal tasks could serve as general tuning objectives for additional modality combinations.

Load-bearing premise

The automated synthesis pipeline successfully converts VQA data into complex, reasoning-intensive editing instructions that meaningfully exercise and improve the model's underlying understanding capacity, rather than merely supplying higher-quality editing examples.

What would settle it

Training a model on the Uni-Edit-148k dataset produces no measurable gains on standard understanding or generation benchmarks, or the gains disappear when compared to a control set of simpler editing examples of matched quality.

Figures

Figures reproduced from arXiv: 2605.21487 by Dian Zheng, Hongbo Liu, Hongsheng Li, Hongyu Li, Kaituo Feng, Kai Zou, Manyuan Zhang.

Figure 1
Figure 1. Figure 1: Overview of Uni-Edit. We introduce intelligent image editing as a general tuning task for UMM. By transforming VQA into reasoning-intensive instructions and generating target images via Nano-Pro, we build Uni-Edit￾148k. Breaking the trade-offs of existing multi-data mixing strategy, it enhances understanding, generation, and editing using only one task, one dataset, and one training stage. Note our automat… view at source ↗
Figure 2
Figure 2. Figure 2: Data Construction pipeline.We first employ GPT-4o to classify the data from LLaVA-OV1.5 into eight distinct edit types, including attribute, caption, math, grounding, and world knowledge. Next, for each category, we use GPT-4o to embed the original question into an editing instruction and explicitly require the model to perform further editing operations based on the answer to the question. This process al… view at source ↗
Figure 3
Figure 3. Figure 3: Data Distribution of Uni-Edit-148k and Uni-Edit-40k. To this end, we fine-tuned BAGEL on understanding tasks using two state-of-the-art open-source datasets: Bee [29] and LLaVA-OV1.5 [5]. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Tuning pipeline. In stage 1, we fine-tune BAGEL on our Uni-Edit data using only the generation loss. In stage 2, we align the distribution of the understanding head with the fine-tuned model using 80k understanding samples. MOT Layer means all of the transformer blocks in BAGEL, Both Und., Gen. heads are a single linear layer. ▷ For OCR and caption, we require the model to first generate a caption or perfo… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of image generation results between Uni-Edit and BAGEL. Tuned on Uni-Edit, the model demonstrates significant improvements in prompt understanding, knowledge reasoning, spatial perception, image composition, and aesthetic quality. boosts the understanding and reasoning ability of the model, resulting in substantial gains on the WISE benchmark. Additionally, since GenEval evaluates spatial reason… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of image editing results between Uni-Edit and BAGEL. Tuned on Uni-Edit, the model shows significant improvements in instruction following, logic, and spatial reasoning [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of Shape Type in Uni-Edit-148k. Origin Question How many rubber balls are the same color as the small metallic block? Edit Instruction Identify Examine the original image to determine the count of rubber balls that are the same color as the small metallic block, based on the question provided. Then, synthesize a visually distinct group of balloons, ensuring the total count matches the number of the… view at source ↗
Figure 8
Figure 8. Figure 8: Example of Count Type in Uni-Edit-148k. Origin Question What is the main subject of this image? Edit Instruction Analyze the original image to identify the main subject based on the given question. Create a new image displaying a close-up of a text medium, such as a chalkboard or parchment. Write a descriptive caption that specifically highlights the main subject of the original image using a 'Handwritten'… view at source ↗
Figure 9
Figure 9. Figure 9: Example of Caption Type in Uni-Edit-148k. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of Color Type in Uni-Edit-148k. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example of OCR Type in Uni-Edit-148k. Origin Question Given the area of the parallelogram ABCD is 102 and the lengths of sides AB and AD are 23 and 14 respectively, calculate the degree of the BAD angle. Round computations to 2 decimal places. Edit Instruction Calculate the degree of the BAD angle in a parallelogram where the area is 102, and the lengths of sides AB and AD are 23 and 14 respectively (Roun… view at source ↗
Figure 13
Figure 13. Figure 13: Example of Math Type in Uni-Edit-148k. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
read the original abstract

Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely resulting in a performance trade-off rather than true mutual reinforcement. To break this paradigm, we propose Uni-Edit, an intelligent image editing task that serves as the first general task for UMM tuning. Unlike complex mixed pipelines, Uni-Edit improves performance across all three abilities at once using only one task, one training stage, and one dataset. Specifically, we first identify image editing as an inherently ideal general task, as it naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions that severely underutilize a model's understanding capacity. To address this, we introduce the first automated and scalable data synthesis pipeline for intelligent editing, transforming diverse VQA data into complex and effective editing instructions with embedded questions and nested logic. This yields Uni-Edit-148k, pairing diverse reasoning-intensive instructions with high-quality edited images. Extensive experiments on BAGEL and Janus-Pro demonstrate that tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities without any auxiliary operations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Uni-Edit, an intelligent image editing task, serves as the first general task for tuning Unified Multimodal Models (UMMs). Unlike mixed multi-task training that requires complex pipelines and leads to performance trade-offs, single-task tuning on Uni-Edit simultaneously improves image understanding, generation, and editing using one task, one stage, and one dataset. The authors introduce an automated scalable synthesis pipeline that converts VQA data into complex editing instructions with embedded questions and nested logic, producing the Uni-Edit-148k dataset of reasoning-intensive instructions paired with edited images. Experiments on BAGEL and Janus-Pro show comprehensive enhancements across all three capabilities without auxiliary operations.

Significance. If the results hold after addressing controls for data quality, this would represent a meaningful simplification for UMM training by showing that a single well-designed task can achieve mutual reinforcement across capabilities. Credit is due for the automated synthesis pipeline and the empirical demonstration on two models. The work could influence future tuning strategies if the reasoning-intensive structure is shown to be load-bearing rather than incidental to data curation.

major comments (2)
  1. [§4 Experiments] §4 Experiments: the abstract and results claim comprehensive enhancements on BAGEL and Janus-Pro after single-task tuning on Uni-Edit, yet no baseline comparisons, exact metrics per capability, statistical significance, or controls for data volume/quality are described. This is load-bearing for the central claim that Uni-Edit outperforms mixed training without auxiliary operations.
  2. [§3.2 Data Synthesis Pipeline] §3.2 Data Synthesis Pipeline: the pipeline is presented as producing 'complex and effective editing instructions with embedded questions and nested logic' that exercise understanding capacity, but no ablation compares performance against a control set of equivalent size and quality using only simplistic instructions. Without this isolation, gains cannot be attributed to the reasoning-intensive structure rather than data curation effects.
minor comments (2)
  1. [Abstract] Abstract: specify the quantitative scale of improvements (e.g., percentage gains on key metrics) to strengthen the summary of results.
  2. [Figure 1] Figure 1 or pipeline diagram: add explicit labels for each transformation step from VQA to editing instruction to improve clarity of the synthesis process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the experimental validation needed to support our central claims. We address each major point below and commit to revisions that strengthen the evidence for Uni-Edit as a general task.

read point-by-point responses
  1. Referee: [§4 Experiments] §4 Experiments: the abstract and results claim comprehensive enhancements on BAGEL and Janus-Pro after single-task tuning on Uni-Edit, yet no baseline comparisons, exact metrics per capability, statistical significance, or controls for data volume/quality are described. This is load-bearing for the central claim that Uni-Edit outperforms mixed training without auxiliary operations.

    Authors: We acknowledge that the manuscript would benefit from more granular reporting to fully substantiate the performance gains. The presented results on BAGEL and Janus-Pro demonstrate improvements across capabilities after Uni-Edit tuning, but we agree that explicit baselines against mixed multi-task training, per-capability numerical metrics (e.g., VQA accuracy for understanding, FID/CLIP scores for generation, and instruction adherence for editing), statistical significance testing, and data-volume/quality controls are necessary. In the revised version, we will expand §4 to include these elements, using matched data volumes from existing sources as controls to isolate the effect of the single-task approach. revision: yes

  2. Referee: [§3.2 Data Synthesis Pipeline] §3.2 Data Synthesis Pipeline: the pipeline is presented as producing 'complex and effective editing instructions with embedded questions and nested logic' that exercise understanding capacity, but no ablation compares performance against a control set of equivalent size and quality using only simplistic instructions. Without this isolation, gains cannot be attributed to the reasoning-intensive structure rather than data curation effects.

    Authors: We agree that directly isolating the contribution of the reasoning-intensive structure (embedded questions and nested logic) versus general data curation effects would strengthen attribution. The current experiments show overall gains from the full Uni-Edit-148k dataset, but lack this specific control. We will add the requested ablation in the revision by constructing a control dataset of equivalent size and quality using only simplistic instructions from the same VQA sources, then compare tuning results on BAGEL and Janus-Pro to demonstrate whether the complex structure is load-bearing. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical tuning and data synthesis with independent experimental validation

full rationale

The paper describes an empirical pipeline for synthesizing editing instructions from VQA data and tuning UMMs on the resulting Uni-Edit-148k dataset. No mathematical derivations, equations, or self-referential definitions appear in the provided text. Claims rest on experimental outcomes across models (BAGEL, Janus-Pro) rather than any fitted parameter renamed as a prediction or uniqueness theorem imported from prior self-citation. The central result—that single-task tuning improves understanding, generation, and editing—is presented as an observed outcome of the synthesis and training process, not reduced by construction to its inputs. This is a standard empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical study of model tuning and data synthesis; it introduces no mathematical axioms, free parameters in derivations, or new postulated physical entities.

pith-pipeline@v0.9.0 · 5767 in / 1308 out tokens · 44904 ms · 2026-05-21T04:30:53.596133+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 20 internal anchors

  1. [1]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

  2. [2]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  3. [3]

    Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

  4. [4]

    Anyedit: Mastering unified high-quality image editing for any idea

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InCVPR, 2025

  5. [5]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025

  6. [6]

    Nano-banana-pro

    Google. Nano-banana-pro. Accessed November, 2025 [Online] https://deepmind.google/models/ gemini-image/pro/, 2025

  7. [7]

    OpenAI. Gpt-4o. Accessed November 18, 2024 [Online]https://chatgpt.com/, 2024

  8. [8]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  9. [9]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  10. [10]

    The llama 3 herd of models.arXiv e-prints, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, 2024

  11. [11]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  12. [12]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

  13. [13]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  14. [14]

    VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

    Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation.arXiv preprint arXiv:2409.04429, 2024. 10 Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

  15. [15]

    Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

    Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

  16. [16]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  17. [17]

    HunyuanImage 3.0 Technical Report

    Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

  18. [18]

    Onecat: Decoder-only auto-regressive model for unified understanding and generation

    Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, and Hongkai Xiong. Onecat: Decoder-only auto-regressive model for unified understanding and generation.arXiv preprint arXiv:2509.03498, 2025

  19. [19]

    AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model

    Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, et al. Architecture decoupling is not all you need for unified multimodal model.arXiv preprint arXiv:2511.22663, 2025

  20. [20]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

  21. [21]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  22. [22]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  23. [23]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  24. [24]

    Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026

    Zhipu AI. Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026

  25. [25]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

  26. [26]

    Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025

    NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, et al. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025

  27. [27]

    Vq-va world: Towards high-quality visual question-visual answering.arXiv preprint arXiv:2511.20573, 2025

    Chenhui Gou, Zilong Chen, Zeyu Wang, Feng Li, Deyao Zhu, Zicheng Duan, Kunchang Li, Chaorui Deng, Hongyi Yuan, Haoqi Fan, et al. Vq-va world: Towards high-quality visual question-visual answering.arXiv preprint arXiv:2511.20573, 2025

  28. [28]

    Factuality matters: When image generation and editing meet structured visuals.arXiv preprint arXiv:2510.05091, 2025

    Le Zhuo, Songhao Han, Yuandong Pu, Boxiang Qiu, Sayak Paul, Yue Liao, Yihao Liu, Jie Shao, Xi Chen, Si Liu, and Hongsheng Li. Factuality matters: When image generation and editing meet structured visuals.arXiv preprint arXiv:2510.05091, 2025

  29. [29]

    Bee: A high-quality corpus and full-stack suite to unlock advanced fully open mllms.arXiv preprint arXiv:2510.13795, 2025

    Yi Zhang, Bolin Ni, Xin-Sheng Chen, Heng-Rui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng-Hao Guo, and Shi-Min Hu. Bee: A high-quality corpus and full-stack suite to unlock advanced fully open mllms.arXiv preprint arXiv:2510.13795, 2025

  30. [30]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

  31. [31]

    Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

  32. [32]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024

  33. [33]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

  34. [34]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024

  35. [35]

    Geneval: An object-focused framework for evaluating text-to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

  36. [36]

    WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

  37. [37]

    ImgEdit: A Unified Image Editing Dataset and Benchmark

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

  38. [38]

    Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing

    Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing.arXiv preprint arXiv:2504.02826, 2025

  39. [39]

    "Your output must be a single JSON object.\n\n

    Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multimodal models. arXiv preprint arXiv:2509.07295, 2025. 11 Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning A System Prompt Task Type Classification System Prompt "You are an expert data processor. Your task is to analyze the inpu...

  40. [40]

    **Blurriness & Artifacts**: - Is the image significantly blurry, pixelated, or noisy? - Are there compression artifacts or "fried" textures? - Is the text (if any) legible, or is it garbled/gibberish?

  41. [41]

    Uncanny Valley

    **Structural Coherence (The "Uncanny Valley" Check)**: - Do objects look physically plausible? - Are there distorted limbs, melted faces, or floating objects that defy gravity? - Is the composition chaotic or nonsensical?

  42. [42]

    pasted-on

    **Visual Harmony**: - Do the lighting and shadows match across the image? - Are there harsh, unnatural seams or "pasted-on" effects (bad compositing)? - Are the colors overly saturated, washed out, or broken? ### Scoring Scale (1-5): - **5 (High Quality)**: Sharp, coherent, natural-looking, and aesthetically pleasing. No visible artifacts. - **4 (Good)**:...

  43. [43]

    **Original Image**: The first input image, which is before editing and is a realistic image

  44. [44]

    **Edited Image**: The second input image, which is the one after editing

  45. [45]

    the region in the answer

    **edit_instruction**: The command the model was supposed to follow. Note that this instruction may involve: - **Spatial Grounding**: Referring to specific regions (e.g., "the region in the answer"). - **Visual Transformation**: Changing style, objects, attributes or doing ocr, caption

  46. [46]

    replace bushes with flower beds

    **original_question & process_answer**: These define the **target** or **premise** of the edit. - If the Answer is a coordinate (bounding box), it defines *where* the edit must happen. - If the Answer is a caption/description, it defines the *answer* for the region and it need to be pushed into a blackboard or letter based on the edit_instruction. ### Eva...