Recognition: unknown
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
Pith reviewed 2026-05-10 08:59 UTC · model grok-4.3
The pith
A single benchmark with distilled evaluators allows fair, low-cost comparison of image and video editing methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UniEditBench supplies a shared protocol covering nine image operations including add, remove, replace, change, stroke-based, extract, adjust, count and reorder, plus eight video operations that include compositional challenges. A large MLLM teacher is distilled into 4B and 8B evaluators that output scores across structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency, and these lightweight judges retain strong agreement with human ratings while cutting deployment cost substantially.
What carries the argument
The distilled 4B/8B MLLM evaluators that deliver multi-dimensional scores aligned with human judgments.
If this is right
- Different editing approaches can be ranked directly against each other using identical rules and metrics.
- Video editing studies gain a consistent way to report results that was previously missing.
- Evaluation runs become cheap enough to repeat across many models and settings.
- Challenging tasks such as spatial reordering receive explicit coverage instead of being ignored by older metrics.
Where Pith is reading between the lines
- Papers on new editing techniques could adopt this protocol so that results become easier to compare across publications.
- The same distillation method might produce affordable judges for other vision-language tasks that currently rely on expensive models.
- Developers could test whether the evaluators remain reliable when applied to editing outputs from models released after the distillation training.
Load-bearing premise
Distilling the large teacher model into the smaller evaluators preserves accurate judgment on every operation, including the hardest compositional ones, without adding new biases.
What would settle it
Collect human ratings on a fresh set of image and video edits that include many count and reorder cases, then measure whether the 4B or 8B evaluators show clearly lower correlation with those humans than the original large model does.
Figures
read the original abstract
The evaluation of visual editing models remains fragmented across methods and modalities. Existing benchmarks are often tailored to specific paradigms, making fair cross-paradigm comparisons difficult, while video editing lacks reliable evaluation benchmarks. Furthermore, common automatic metrics often misalign with human preference, yet directly deploying large multimodal models (MLLMs) as evaluators incurs prohibitive computational and financial costs. We present UniEditBench, a unified benchmark for image and video editing that supports reconstruction-based and instruction-driven methods under a shared protocol. UniEditBench includes a structured taxonomy of nine image operations (Add, Remove, Replace, Change, Stroke-based, Extract, Adjust, Count, Reorder) and eight video operations, with coverage of challenging compositional tasks such as counting and spatial reordering. To enable scalable evaluation, we distill a high-capacity MLLM judge (Qwen3-VL-235B-A22B Instruct) into lightweight 4B/8B evaluators that provide multi-dimensional scoring over structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency (for videos). Experiments show that the distilled evaluators maintain strong agreement with human judgments and substantially reduce deployment cost relative to the teacher model. UniEditBench provides a practical and reproducible protocol for benchmarking modern visual editing methods. Our benchmark and the associated reward models are publicly available at https://github.com/wesar1/UniEditBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces UniEditBench, a unified benchmark for image and video editing that supports both reconstruction-based and instruction-driven methods under a shared protocol. It defines a taxonomy of nine image operations (Add, Remove, Replace, Change, Stroke-based, Extract, Adjust, Count, Reorder) and eight video operations, including compositional tasks. The authors distill a large MLLM (Qwen3-VL-235B-A22B Instruct) into lightweight 4B/8B evaluators that provide multi-dimensional scores on structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency. Experiments are claimed to show strong human agreement and substantial cost reductions relative to the teacher model, with the benchmark and reward models released publicly.
Significance. If the experimental claims on agreement and cost savings are substantiated with rigorous metrics, this benchmark could standardize evaluation across fragmented visual editing paradigms and modalities, addressing misalignment between automatic metrics and human preferences while lowering barriers to scalable assessment. The public release of the benchmark and models is a clear strength for reproducibility.
major comments (2)
- [Abstract and Experiments section] Abstract and Experiments section: The central claim that distilled 4B/8B evaluators 'maintain strong agreement with human judgments' lacks any quantitative details on agreement metrics (e.g., Pearson correlation, Cohen's kappa, or Spearman rho), number of human raters, test set sizes or splits, statistical significance testing, or controls for data leakage during distillation from the teacher model. This is load-bearing for the reliability claim across all nine image and eight video operations.
- [Distillation and Evaluation sections] Distillation process description: No equations, loss formulations, or ablation results are provided to show how multi-dimensional scoring is preserved in the smaller models, leaving open the risk of degradation on challenging compositional tasks such as Count and Reorder. This directly affects the weakest assumption that agreement holds without hidden biases.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., agreement score or cost reduction factor) rather than qualitative statements.
- [Figures] Ensure all figure captions explicitly label the evaluation dimensions and clarify whether results are averaged across operations or reported per-operation.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below and have revised the manuscript to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract and Experiments section] Abstract and Experiments section: The central claim that distilled 4B/8B evaluators 'maintain strong agreement with human judgments' lacks any quantitative details on agreement metrics (e.g., Pearson correlation, Cohen's kappa, or Spearman rho), number of human raters, test set sizes or splits, statistical significance testing, or controls for data leakage during distillation from the teacher model. This is load-bearing for the reliability claim across all nine image and eight video operations.
Authors: We agree that the Experiments section would benefit from more explicit quantitative support for the agreement claims. In the revised manuscript we expand this section with a dedicated table and text reporting Pearson correlations, Cohen's kappa, and Spearman rho per dimension and operation, the number of human raters and inter-rater agreement, test-set sizes and splits, statistical significance tests, and the train/test partitioning used during distillation to mitigate data leakage. These additions directly substantiate the reliability claims for all nine image and eight video operations. revision: yes
-
Referee: [Distillation and Evaluation sections] Distillation process description: No equations, loss formulations, or ablation results are provided to show how multi-dimensional scoring is preserved in the smaller models, leaving open the risk of degradation on challenging compositional tasks such as Count and Reorder. This directly affects the weakest assumption that agreement holds without hidden biases.
Authors: We concur that the distillation description is currently insufficient. We have revised the Distillation and Evaluation sections to include the loss equations and formulations employed to transfer multi-dimensional scoring, together with ablation results that compare the 4B/8B models against the teacher on compositional operations including Count and Reorder. These additions clarify preservation of scoring fidelity and address potential degradation or bias concerns. revision: yes
Circularity Check
No significant circularity in benchmark construction or evaluator validation
full rationale
The paper independently defines a new taxonomy of nine image and eight video operations, constructs UniEditBench under a shared protocol for reconstruction-based and instruction-driven methods, and distills lightweight evaluators from an external teacher model (Qwen3-VL-235B-A22B Instruct). Reported agreement with human judgments is obtained via separate empirical evaluation on held-out data rather than by construction from fitted parameters or self-referential equations. No load-bearing self-citations, ansatzes smuggled via prior work, or reductions of predictions to inputs are present; the derivation chain remains self-contained against external benchmarks and human annotations.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Qingyan Bai, Qiuyu Wang, Hao Ouyang, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Yue Yu, Zichen Liu, et al . [n. d.]. Ditto: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset. ([n. d.])
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [3]
-
[5]
Tim Brooks, Aleksander Holynski, and Alexei A Efros. 2023. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18392–18402
2023
-
[6]
Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. 2023. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF international confer- ence on computer vision. 22560–22570
2023
- [7]
-
[8]
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. 2024. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7310–7320
2024
- [9]
- [10]
-
[11]
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. 2025. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683(2025)
work page internal anchor Pith review arXiv 2025
-
[12]
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. InForty- first international conference on machine learning
2024
-
[13]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al . 2024. A survey on llm-as-a-judge.The Innovation(2024)
2024
-
[14]
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2023. Minillm: Knowledge distillation of large language models.arXiv preprint arXiv:2306.08543(2023)
work page internal anchor Pith review arXiv 2023
-
[15]
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626(2022)
work page internal anchor Pith review arXiv 2022
-
[16]
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi
-
[17]
In Proceedings of the 2021 conference on empirical methods in natural language processing
Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 conference on empirical methods in natural language processing. 7514–7528
2021
-
[18]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851
2020
-
[19]
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. 2023. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023. 8003–8017
2023
- [20]
-
[21]
Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. 2023. Pnp inversion: Boosting diffusion-based editing with 3 lines of code. InThe Twelfth International Conference on Learning Representations
2023
-
[22]
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. 2024. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, and Tomer Michaeli. 2025. Flowedit: Inversion-free text-based editing using pre-trained flow models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 19721–19730
2025
-
[24]
Black Forest Labs. 2024. FLUX. https://github.com/black-forest-labs/flux
2024
-
[25]
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. 2025. FLUX. 1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space.arXiv preprint arXiv:2506.15742(2025)
work page internal anchor Pith review arXiv 2025
- [26]
-
[27]
Minghan Li, Chenxi Xie, Yichen Wu, Lei Zhang, and Mengyu Wang. 2025. Five- bench: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 16672–16681
2025
- [28]
- [29]
- [30]
-
[31]
Haonan Lin, Yan Chen, Jiahao Wang, Wenbin An, Mengmeng Wang, Feng Tian, Yong Liu, Guang Dai, Jingdong Wang, and Qianying Wang. 2024. Schedule your edit: A simple yet effective diffusion noise schedule for image editing.Advances in Neural Information Processing Systems37 (2024), 115712–115756
2024
-
[32]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual In- struction Tuning. arXiv:2304.08485 [cs.CV] https://arxiv.org/abs/2304.08485
work page internal anchor Pith review arXiv 2023
-
[33]
Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al . 2025. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761 (2025)
work page internal anchor Pith review arXiv 2025
-
[34]
Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al . 2024. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177(2024)
work page internal anchor Pith review arXiv 2024
- [35]
-
[36]
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. 2021. Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073(2021)
work page internal anchor Pith review arXiv 2021
-
[37]
Jiteng Mu, Nuno Vasconcelos, and Xiaolong Wang. 2025. Editar: Unified con- ditional generation with autoregressive models. InProceedings of the Computer Vision and Pattern Recognition Conference. 7899–7909. Jiang et al
2025
-
[38]
Bosheng Qin, Juncheng Li, Siliang Tang, Tat-Seng Chua, and Yueting Zhuang
-
[39]
In2024 IEEE International Conference on Multimedia and Expo (ICME)
Instructvid2vid: Controllable video editing with natural language instruc- tions. In2024 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6
-
[40]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al
-
[41]
In International conference on machine learning
Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763
-
[42]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695
2022
- [43]
-
[44]
Alexander Tanchenko. 2014. Visual-PSNR measure of image quality.Journal of Visual Communication and Image Representation25, 5 (2014), 874–878
2014
-
[46]
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. 2025. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [47]
-
[48]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng- ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al . 2025. Qwen-image technical report.arXiv preprint arXiv:2508.02324(2025)
work page internal anchor Pith review arXiv 2025
-
[49]
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. 2025. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [50]
- [51]
- [52]
- [53]
- [54]
-
[55]
Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. 2025. Imgedit: A unified image editing dataset and benchmark. arXiv preprint arXiv:2505.20275(2025)
work page internal anchor Pith review arXiv 2025
-
[56]
Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. 2025. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the Computer Vision and Pattern Recognition Conference. 26125–26135
2025
-
[58]
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF inter- national conference on computer vision. 3836–3847
2023
-
[59]
Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. 2025. Enabling in- structional image editing with in-context generation in large scale diffusion transformer. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems
2025
-
[60]
Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al . 2025. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 29733–29735. UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video...
2025
-
[61]
**Structural Fidelity**: Evaluate whether the structure, pose, orientation, and spatial relationships of unedited entities remain perfectly consistent with the original image
-
[62]
**Text-Image Alignment**: Evaluate how accurately the edited content reflects the requested changes described in the edited prompt
-
[63]
Check for unwanted changes, color shifts, or distortions in unedited areas
**Background Consistency**: Evaluate the consistency of all regions EXCEPT the edited subject. Check for unwanted changes, color shifts, or distortions in unedited areas
-
[64]
explanation
**Naturalness**: Evaluate whether the overall scene appears natural. Look for noticeable flaws such as inconsistent lighting, perspective errors, structural distortions, or watermarks. **Important Guidelines:** - [CRITICAL] You must FIRST provide a detailed step-by-step explanation for each dimension, and THEN output the final numerical scores. - Be criti...
-
[65]
**Structural Fidelity**: Evaluate whether the structure, pose, orientation, and spatial relationships of unedited entities remain consistent with the original video
-
[66]
**Text-Video Alignment**: Evaluate how accurately the edited content reflects the requested changes described in the edited prompt
-
[67]
Table 5: Detailed MSE on Image Editing categorized by editing operations
**Background Consistency**: Evaluate the consistency Jiang et al. Table 5: Detailed MSE on Image Editing categorized by editing operations. Model Add Adjust Change Count Extract Remove Reorder Replace Stroke Overall Zero-shot 4B 1.94 1.91 3.43 2.85 5.92 3.38 4.00 2.11 1.47 2.95 Zero-shot 8B 1.34 1.46 2.46 2.34 5.24 3.00 3.95 1.87 2.11 2.51 SFT Image 4B 0....
-
[68]
**Naturalness**: Evaluate whether the overall video appears natural, checking for flaws like inconsistent lighting, perspective errors, or structural distortions
-
[69]
explanation
**Temporal-Spatial Consistency**: Focus on the continuity and logical rationality of the video across time and space. Check if the object maintains identity consistency, if motion trajectories are smooth, and if spatial logic (e.g., gravity, depth) is reasonable. Penalize any flickering, teleportation, or physical violations. **Important Guidelines:** - [...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.