PaintBench: Deterministic Evaluation of Precise Visual Editing
Pith reviewed 2026-06-28 19:18 UTC · model grok-4.3
The pith
Current multimodal models score at most 17.1 percent on precise visual editing tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PaintBench targets 20 fundamental precise visual editing operations across geometric transformation, structural manipulation, color change, and symbolic reasoning. Its procedural generation with configurable complexity produces an effectively infinite test suite, and deterministic pixel-level mIoU evaluation removes dependence on judge models. Across 11 image editing models the highest score is 17.1 percent mIoU, with geometric transformations, most structural manipulations, and formula-based color changes proving especially difficult; fine-grained diagnostics link performance drops to object count, background complexity, color scheme, and edit-region size. Scores on PaintBench also show str
What carries the argument
PaintBench benchmark of 20 procedurally generated precise visual editing operations evaluated by deterministic pixel-level mean intersection over union.
If this is right
- Targeted gains on geometric transformations and structural manipulations would be required to raise overall model performance on precise edits.
- Robustness to changes in object count, background complexity, and edit-region size would need to improve for consistent results across scenes.
- The observed linear correlation with applied editing performance implies PaintBench scores can forecast success on related tasks without separate testing.
- Replacement of judge-model evaluation with pixel mIoU removes one source of bias in measuring editing accuracy.
Where Pith is reading between the lines
- Model developers could incorporate PaintBench scores to prioritize training or fine-tuning on the hardest operation categories.
- The benchmark's modular design allows straightforward addition of new operation types or complexity levels to cover emerging editing needs.
- Model-specific performance patterns suggest that combining outputs from multiple specialized models could raise aggregate accuracy on mixed tasks.
- Widespread adoption might shift evaluation standards in the field toward deterministic, pixel-exact metrics for editing rather than open-ended generation.
Load-bearing premise
The 20 selected operations and their procedural generation rules form a representative proxy for all precise visual editing needs.
What would settle it
A model achieving high PaintBench scores yet low accuracy on a broad collection of real user-specified precise edits outside the procedural generation rules would disprove its value as a general proxy.
read the original abstract
While current multimodal models are proficient at open-ended visual editing, executing precise single-answer edits remains an important obstacle. To probe this challenge, we introduce PaintBench, a dynamically scalable benchmark targeting 20 fundamental precise visual editing operations across four categories: geometric transformation, structural manipulation, color change, and symbolic reasoning. Procedural generation with configurable complexity enables an effectively infinite, contamination-resistant evaluation suite, and deterministic pixel-level evaluation eliminates reliance on bias-prone judge models. Across 11 image editing models, we find overall low performance, with the current highest-performing industry leader scoring only 17.1% (mIoU). Task decomposition reveals especially challenging operation types (geometric transformation, most structural manipulation, formula-based color change) and model-specific specializations. Fine-grained benchmark diagnostics further show performance degradations induced by scene variations in object count, background complexity, color scheme, and edit-region size. To test generalization of PaintBench scores to applied task performance, we create a procedural, deterministic evaluation for data visualization editing (TinyGrafixBench) and find strong linear correlation with PaintBench scores ($R^2 = 0.91$, $p < 0.001$). Altogether, PaintBench provides a rigorous foundation for measuring and driving progress in precise multimodal visual editing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PaintBench, a dynamically scalable benchmark for 20 precise visual editing operations in four categories (geometric transformation, structural manipulation, color change, symbolic reasoning). It relies on procedural generation for an effectively infinite, contamination-resistant suite and deterministic pixel-level mIoU evaluation that avoids judge-model bias. Across 11 models the highest score is 17.1% mIoU; task decomposition and scene-variation diagnostics are reported. Generalization is tested via a new procedural TinyGrafixBench for data-visualization editing, yielding R²=0.91 (p<0.001).
Significance. If the central claims hold, PaintBench supplies a reproducible, bias-free, and scalable instrument for measuring progress on precise single-answer editing—an area where current models demonstrably underperform. The procedural generation and deterministic ground-truth evaluation are explicit strengths that remove reliance on subjective judges and enable parameter-free scaling; the reported correlation with TinyGrafixBench provides an internal consistency check between two synthetic benchmarks.
major comments (2)
- [Abstract / TinyGrafixBench section] Abstract and the section describing TinyGrafixBench: the assertion that the R²=0.91 correlation demonstrates generalization 'to applied task performance' is not supported by the evidence presented. Both PaintBench and TinyGrafixBench are procedurally generated with deterministic pixel-level ground truth; neither incorporates ambiguous natural-language instructions nor non-synthetic image distributions typical of real user edits. This leaves the external-validity step untested and weakens the claim that PaintBench scores will drive progress on real-world applications.
- [Evaluation / correlation analysis] Evaluation and correlation sections: the manuscript does not supply the data-exclusion rules, outlier-handling procedure, or error-analysis details used to compute the reported mIoU values and the R²=0.91 correlation. Without these, it is impossible to determine whether post-hoc choices affect the central performance rankings or the linear-fit result.
minor comments (1)
- [Benchmark definition] A table enumerating all 20 operations with their exact procedural parameters and complexity controls would improve reproducibility and allow readers to assess coverage.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments below and will make revisions to improve clarity and completeness.
read point-by-point responses
-
Referee: [Abstract / TinyGrafixBench section] Abstract and the section describing TinyGrafixBench: the assertion that the R²=0.91 correlation demonstrates generalization 'to applied task performance' is not supported by the evidence presented. Both PaintBench and TinyGrafixBench are procedurally generated with deterministic pixel-level ground truth; neither incorporates ambiguous natural-language instructions nor non-synthetic image distributions typical of real user edits. This leaves the external-validity step untested and weakens the claim that PaintBench scores will drive progress on real-world applications.
Authors: We agree that the original phrasing 'to applied task performance' overstates the scope of the evidence. Both benchmarks rely on procedural generation and deterministic evaluation, so the correlation shows consistency across synthetic editing tasks rather than generalization to real-world edits with natural images or ambiguous instructions. We will revise the abstract and TinyGrafixBench section to state that the correlation indicates generalization to other procedural editing domains, removing any implication of real-world applicability. revision: yes
-
Referee: [Evaluation / correlation analysis] Evaluation and correlation sections: the manuscript does not supply the data-exclusion rules, outlier-handling procedure, or error-analysis details used to compute the reported mIoU values and the R²=0.91 correlation. Without these, it is impossible to determine whether post-hoc choices affect the central performance rankings or the linear-fit result.
Authors: The manuscript indeed omits these procedural details. In the revision we will add an explicit subsection under Evaluation describing the computation pipeline: all procedurally generated samples were retained with no exclusions or outlier removal; mIoU was averaged per model across all tasks and scenes; the R² value was obtained via ordinary least-squares regression on the 11 model-level mean mIoU scores; and we will report the exact sample counts and any sensitivity checks performed. revision: yes
Circularity Check
No significant circularity; benchmark definition and validation are independent
full rationale
The paper constructs PaintBench via procedural generation of 20 operations with deterministic pixel-level mIoU ground truth, then reports model scores and a correlation (R²=0.91) to a separately defined TinyGrafixBench. Neither the benchmark definition nor the reported correlation reduces to a self-referential fit, self-citation chain, or imported uniqueness theorem; both benchmarks are constructed independently of model outputs and the correlation is presented as an external check rather than a tautological prediction. No load-bearing steps match the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Procedural generation with configurable complexity produces contamination-resistant test cases
- domain assumption Pixel-level mIoU is an appropriate and unbiased measure of precise editing success
Reference graph
Works this paper leans on
-
[1]
InstructPix2Pix: Learning to Follow Image Editing Instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. InstructPix2Pix: Learning to Follow Image Editing Instructions. InCVPR, 2023
2023
-
[2]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space.arXiv preprint arXiv:2506.15742, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
ChatGPT Images 2.0 System Card
OpenAI. ChatGPT Images 2.0 System Card. https://deploymentsafety.openai.com/chat gpt-images-2-0/introduction, 2026
2026
-
[4]
Nano Banana 2 (Gemini 3.1 Flash Image)
Google. Nano Banana 2 (Gemini 3.1 Flash Image). https://deepmind.google/models/gemin i-image/flash/, 2026
2026
-
[5]
MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing
Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing. InNeurIPS, 2023
2023
-
[6]
Samyadeep Basu, Mehrdad Saberi, Shweta Bhardwaj, Atoosa Malemir Chegini, Daniela Massiceti, Maziar Sanjabi, Shell Xu Hu, and Soheil Feizi. EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods.arXiv preprint arXiv:2310.02426, 2023
-
[7]
I2EBench: A Comprehensive Benchmark for Instruction-Based Image Editing
Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, and Rongrong Ji. I2EBench: A Comprehensive Benchmark for Instruction-Based Image Editing. InNeurIPS, 2024
2024
-
[8]
Image Quality Assessment: From Error Visibility to Structural Similarity.IEEE TIP, 2004
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image Quality Assessment: From Error Visibility to Structural Similarity.IEEE TIP, 2004
2004
-
[9]
The Unreason- able Effectiveness of Deep Features as a Perceptual Metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The Unreason- able Effectiveness of Deep Features as a Perceptual Metric. InCVPR, 2018
2018
-
[10]
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS, 2017
2017
-
[11]
Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, et al. UniREditBench: A Unified Reasoning-based Image Editing Benchmark.arXiv preprint arXiv:2511.01295, 2025
-
[12]
Complex- Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark
Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, and Cihang Xie. Complex- Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark. TMLR, 2026
2026
-
[13]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[14]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-Token Prediction is All You Need. arXiv preprint arXiv:2409.18869, 2024. 18 PA I N TBE N C H: Deterministic Evaluation of Precise Visual Editing
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. InNeurIPS, 2024
2024
-
[16]
Beyond Language Modeling: An Exploration of Multimodal Pretraining
Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, Naila Murray, Marjan Ghazvininejad, Mike Lewis, Nicolas Ballas, Amir Bar, Michael Rabbat, Jakob Verbeek, Luke Zettlemoyer, Koustuv Sinha, Yann LeCun, and Saining Xie. Beyond Language Modeling: An Exploration of Multimod...
2026
-
[17]
Shanshan Zhao, Xinjie Zhang, Jintao Guo, Jiakui Hu, Lunhao Duan, Minghao Fu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. Unified Multimodal Understand- ing and Generation Models: Advances, Challenges, and Opportunities.arXiv preprint arXiv:2505.02567, 2025
-
[18]
Image Generators are Generalist Vision Learners
Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T Barron, et al. Image Generators are Generalist Vision Learners.arXiv preprint arXiv:2604.20329, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
ImgEdit: A Unified Image Editing Dataset and Benchmark
Yang Ye, Xianyi He, Zongjian Li, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, Li Yuan, et al. ImgEdit: A Unified Image Editing Dataset and Benchmark. InNeurIPS, 2026
2026
-
[20]
Runzhou Liu, Hailey Weingord, Sejal Mittal, Prakhar Dungarwal, Anusha Nandula, Bo Ni, Samyadeep Basu, Hongjie Chen, Nesreen K Ahmed, Li Li, et al. Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis. arXiv preprint arXiv:2602.13028, 2026
-
[21]
VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation
Divake Kumar, Sina Tayebati, Devashri Naik, Ranganath Krishnan, and Amit Ranjan Trivedi. VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evalua- tion.arXiv preprint arXiv:2604.25235, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Envisioning Beyond the Pixels: Bench- marking Reasoning-Informed Visual Editing
Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning Beyond the Pixels: Bench- marking Reasoning-Informed Visual Editing. InNeurIPS, 2026
2026
-
[23]
KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models
Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, and Xu Yang. KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models. InNeurIPS, 2026
2026
-
[24]
GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment. InNeurIPS, 2023
2023
-
[25]
T2I-CompBench: A Compre- hensive Benchmark for Open-World Compositional Text-to-Image Generation
Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2I-CompBench: A Compre- hensive Benchmark for Open-World Compositional Text-to-Image Generation. InNeurIPS, 2023
2023
-
[26]
Geneval 2: Addressing benchmark drift in text-to-image evaluation.ArXiv, abs/2512.16853, 2025
Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, and Marjan Ghazvininejad. GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation.arXiv preprint arXiv:2512.16853, 2025
-
[27]
A Very Big Video Reasoning Suite
Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, et al. A Very Big Video Reasoning Suite. arXiv preprint arXiv:2602.20159, 2026. 19 PA I N TBE N C H: Deterministic Evaluation of Precise Visual Editing
-
[28]
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. InCVPR, 2017
2017
-
[29]
On the Measure of Intelligence
François Chollet. On the Measure of Intelligence.arXiv preprint arXiv:1911.01547, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[30]
RAVEN: A Dataset for Relational and Analogical Visual Reasoning
Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. RAVEN: A Dataset for Relational and Analogical Visual Reasoning. InCVPR, 2019
2019
-
[31]
Bongard- LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning
Weili Nie, Zhiding Yu, Lei Mao, Ankit B Patel, Yuke Zhu, and Anima Anandkumar. Bongard- LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning. InNeurIPS, 2020
2020
-
[32]
Xinxin Liu, Zhaopan Xu, Ming Li, Kai Wang, Yong Jae Lee, and Yuzhang Shang. Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark.arXiv preprint arXiv:2511.13853, 2025
-
[33]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-Image Technical Report.arXiv preprint arXiv:2508.02324, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
LongCat-Image Technical Report
Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. LongCat-Image Technical Report. arXiv preprint arXiv:2512.07584, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
FLUX.2: Frontier Visual Intelligence
Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025
2025
-
[36]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging Properties in Unified Multimodal Pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
HunyuanImage 3.0 Technical Report
Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. HunyuanImage 3.0 Technical Report.arXiv preprint arXiv:2509.23951, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Introducing Gemini 2.5 Flash Image
Google. Introducing Gemini 2.5 Flash Image. https://developers.googleblog.com/en/int roducing-gemini-2-5-flash-image/, 2025
2025
-
[39]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-Modal Early-Fusion Foundation Models.arXiv preprint arXiv:2405.09818, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model. InICLR, 2025
2025
-
[41]
MetaMorph: Multimodal Under- standing and Generation via Instruction Tuning
Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. MetaMorph: Multimodal Under- standing and Generation via Instruction Tuning. InICCV, 2025
2025
-
[42]
The CIE 1976 Color-Difference Formulae.Color Research & Application, 1977
Alan R Robertson. The CIE 1976 Color-Difference Formulae.Color Research & Application, 1977
1976
-
[43]
Optimizing Color Performance of the Ngenuity 3-Dimensional Visualization System.Ophthalmology Science, 2021
Samuel A Minaker, Ryan H Mason, and David R Chow. Optimizing Color Performance of the Ngenuity 3-Dimensional Visualization System.Ophthalmology Science, 2021. 20 PA I N TBE N C H: Deterministic Evaluation of Precise Visual Editing
2021
-
[44]
Delta E 101.https://zschuessler.github.io/DeltaE/learn/, 2016
Zachary Schuessler. Delta E 101.https://zschuessler.github.io/DeltaE/learn/, 2016
2016
-
[45]
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common Objects in Context. InECCV, 2014
2014
-
[46]
R” = rotatable; “AR-free
Bradley Efron and Robert J Tibshirani.An Introduction to the Bootstrap. Chapman and Hall/CRC, 1994. 21 PA I N TBE N C H: Deterministic Evaluation of Precise Visual Editing Appendix This appendix provides benchmark construction details, experimental details, qualitative examples, and extended results supporting the main paper: • (§A)Benchmark Construction:...
1994
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.