pith. sign in

arxiv: 2606.00188 · v1 · pith:Z6D3MTFMnew · submitted 2026-05-29 · 💻 cs.GR · cs.CV· cs.LG

PaintBench: Deterministic Evaluation of Precise Visual Editing

Pith reviewed 2026-06-28 19:18 UTC · model grok-4.3

classification 💻 cs.GR cs.CVcs.LG
keywords PaintBenchprecise visual editingmultimodal modelsbenchmark evaluationprocedural generationmIoUimage manipulationdeterministic evaluation
0
0 comments X

The pith

Current multimodal models score at most 17.1 percent on precise visual editing tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PaintBench is introduced as a benchmark for precise, single-answer visual editing using 20 operations in geometric, structural, color, and symbolic categories. Procedural generation creates scalable, contamination-free tests evaluated deterministically by pixel mIoU. Evaluation of 11 models reveals low performance overall and specific difficulties with geometric transformations and structural changes. Scores on this benchmark correlate strongly with performance on a data visualization editing task, suggesting it measures transferable editing ability. This provides a way to drive progress where open-ended editing is insufficient.

Core claim

PaintBench targets 20 fundamental precise visual editing operations across geometric transformation, structural manipulation, color change, and symbolic reasoning. Its procedural generation with configurable complexity produces an effectively infinite test suite, and deterministic pixel-level mIoU evaluation removes dependence on judge models. Across 11 image editing models the highest score is 17.1 percent mIoU, with geometric transformations, most structural manipulations, and formula-based color changes proving especially difficult; fine-grained diagnostics link performance drops to object count, background complexity, color scheme, and edit-region size. Scores on PaintBench also show str

What carries the argument

PaintBench benchmark of 20 procedurally generated precise visual editing operations evaluated by deterministic pixel-level mean intersection over union.

If this is right

  • Targeted gains on geometric transformations and structural manipulations would be required to raise overall model performance on precise edits.
  • Robustness to changes in object count, background complexity, and edit-region size would need to improve for consistent results across scenes.
  • The observed linear correlation with applied editing performance implies PaintBench scores can forecast success on related tasks without separate testing.
  • Replacement of judge-model evaluation with pixel mIoU removes one source of bias in measuring editing accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model developers could incorporate PaintBench scores to prioritize training or fine-tuning on the hardest operation categories.
  • The benchmark's modular design allows straightforward addition of new operation types or complexity levels to cover emerging editing needs.
  • Model-specific performance patterns suggest that combining outputs from multiple specialized models could raise aggregate accuracy on mixed tasks.
  • Widespread adoption might shift evaluation standards in the field toward deterministic, pixel-exact metrics for editing rather than open-ended generation.

Load-bearing premise

The 20 selected operations and their procedural generation rules form a representative proxy for all precise visual editing needs.

What would settle it

A model achieving high PaintBench scores yet low accuracy on a broad collection of real user-specified precise edits outside the procedural generation rules would disprove its value as a general proxy.

read the original abstract

While current multimodal models are proficient at open-ended visual editing, executing precise single-answer edits remains an important obstacle. To probe this challenge, we introduce PaintBench, a dynamically scalable benchmark targeting 20 fundamental precise visual editing operations across four categories: geometric transformation, structural manipulation, color change, and symbolic reasoning. Procedural generation with configurable complexity enables an effectively infinite, contamination-resistant evaluation suite, and deterministic pixel-level evaluation eliminates reliance on bias-prone judge models. Across 11 image editing models, we find overall low performance, with the current highest-performing industry leader scoring only 17.1% (mIoU). Task decomposition reveals especially challenging operation types (geometric transformation, most structural manipulation, formula-based color change) and model-specific specializations. Fine-grained benchmark diagnostics further show performance degradations induced by scene variations in object count, background complexity, color scheme, and edit-region size. To test generalization of PaintBench scores to applied task performance, we create a procedural, deterministic evaluation for data visualization editing (TinyGrafixBench) and find strong linear correlation with PaintBench scores ($R^2 = 0.91$, $p < 0.001$). Altogether, PaintBench provides a rigorous foundation for measuring and driving progress in precise multimodal visual editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces PaintBench, a dynamically scalable benchmark for 20 precise visual editing operations in four categories (geometric transformation, structural manipulation, color change, symbolic reasoning). It relies on procedural generation for an effectively infinite, contamination-resistant suite and deterministic pixel-level mIoU evaluation that avoids judge-model bias. Across 11 models the highest score is 17.1% mIoU; task decomposition and scene-variation diagnostics are reported. Generalization is tested via a new procedural TinyGrafixBench for data-visualization editing, yielding R²=0.91 (p<0.001).

Significance. If the central claims hold, PaintBench supplies a reproducible, bias-free, and scalable instrument for measuring progress on precise single-answer editing—an area where current models demonstrably underperform. The procedural generation and deterministic ground-truth evaluation are explicit strengths that remove reliance on subjective judges and enable parameter-free scaling; the reported correlation with TinyGrafixBench provides an internal consistency check between two synthetic benchmarks.

major comments (2)
  1. [Abstract / TinyGrafixBench section] Abstract and the section describing TinyGrafixBench: the assertion that the R²=0.91 correlation demonstrates generalization 'to applied task performance' is not supported by the evidence presented. Both PaintBench and TinyGrafixBench are procedurally generated with deterministic pixel-level ground truth; neither incorporates ambiguous natural-language instructions nor non-synthetic image distributions typical of real user edits. This leaves the external-validity step untested and weakens the claim that PaintBench scores will drive progress on real-world applications.
  2. [Evaluation / correlation analysis] Evaluation and correlation sections: the manuscript does not supply the data-exclusion rules, outlier-handling procedure, or error-analysis details used to compute the reported mIoU values and the R²=0.91 correlation. Without these, it is impossible to determine whether post-hoc choices affect the central performance rankings or the linear-fit result.
minor comments (1)
  1. [Benchmark definition] A table enumerating all 20 operations with their exact procedural parameters and complexity controls would improve reproducibility and allow readers to assess coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will make revisions to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract / TinyGrafixBench section] Abstract and the section describing TinyGrafixBench: the assertion that the R²=0.91 correlation demonstrates generalization 'to applied task performance' is not supported by the evidence presented. Both PaintBench and TinyGrafixBench are procedurally generated with deterministic pixel-level ground truth; neither incorporates ambiguous natural-language instructions nor non-synthetic image distributions typical of real user edits. This leaves the external-validity step untested and weakens the claim that PaintBench scores will drive progress on real-world applications.

    Authors: We agree that the original phrasing 'to applied task performance' overstates the scope of the evidence. Both benchmarks rely on procedural generation and deterministic evaluation, so the correlation shows consistency across synthetic editing tasks rather than generalization to real-world edits with natural images or ambiguous instructions. We will revise the abstract and TinyGrafixBench section to state that the correlation indicates generalization to other procedural editing domains, removing any implication of real-world applicability. revision: yes

  2. Referee: [Evaluation / correlation analysis] Evaluation and correlation sections: the manuscript does not supply the data-exclusion rules, outlier-handling procedure, or error-analysis details used to compute the reported mIoU values and the R²=0.91 correlation. Without these, it is impossible to determine whether post-hoc choices affect the central performance rankings or the linear-fit result.

    Authors: The manuscript indeed omits these procedural details. In the revision we will add an explicit subsection under Evaluation describing the computation pipeline: all procedurally generated samples were retained with no exclusions or outlier removal; mIoU was averaged per model across all tasks and scenes; the R² value was obtained via ordinary least-squares regression on the 11 model-level mean mIoU scores; and we will report the exact sample counts and any sensitivity checks performed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark definition and validation are independent

full rationale

The paper constructs PaintBench via procedural generation of 20 operations with deterministic pixel-level mIoU ground truth, then reports model scores and a correlation (R²=0.91) to a separately defined TinyGrafixBench. Neither the benchmark definition nor the reported correlation reduces to a self-referential fit, self-citation chain, or imported uniqueness theorem; both benchmarks are constructed independently of model outputs and the correlation is presented as an external check rather than a tautological prediction. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the chosen 20 operations and mIoU metric capture the essence of precise editing without introducing selection bias; no free parameters, invented entities, or non-standard axioms are invoked in the abstract.

axioms (2)
  • domain assumption Procedural generation with configurable complexity produces contamination-resistant test cases
    Invoked to justify the infinite suite and lack of data leakage concerns
  • domain assumption Pixel-level mIoU is an appropriate and unbiased measure of precise editing success
    Used to eliminate reliance on judge models

pith-pipeline@v0.9.1-grok · 5766 in / 1357 out tokens · 16344 ms · 2026-06-28T19:18:06.125230+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 18 canonical work pages · 11 internal anchors

  1. [1]

    InstructPix2Pix: Learning to Follow Image Editing Instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. InstructPix2Pix: Learning to Follow Image Editing Instructions. InCVPR, 2023

  2. [2]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space.arXiv preprint arXiv:2506.15742, 2025

  3. [3]

    ChatGPT Images 2.0 System Card

    OpenAI. ChatGPT Images 2.0 System Card. https://deploymentsafety.openai.com/chat gpt-images-2-0/introduction, 2026

  4. [4]

    Nano Banana 2 (Gemini 3.1 Flash Image)

    Google. Nano Banana 2 (Gemini 3.1 Flash Image). https://deepmind.google/models/gemin i-image/flash/, 2026

  5. [5]

    MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing. InNeurIPS, 2023

  6. [6]

    EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods.arXiv preprint arXiv:2310.02426, 2023

    Samyadeep Basu, Mehrdad Saberi, Shweta Bhardwaj, Atoosa Malemir Chegini, Daniela Massiceti, Maziar Sanjabi, Shell Xu Hu, and Soheil Feizi. EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods.arXiv preprint arXiv:2310.02426, 2023

  7. [7]

    I2EBench: A Comprehensive Benchmark for Instruction-Based Image Editing

    Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, and Rongrong Ji. I2EBench: A Comprehensive Benchmark for Instruction-Based Image Editing. InNeurIPS, 2024

  8. [8]

    Image Quality Assessment: From Error Visibility to Structural Similarity.IEEE TIP, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image Quality Assessment: From Error Visibility to Structural Similarity.IEEE TIP, 2004

  9. [9]

    The Unreason- able Effectiveness of Deep Features as a Perceptual Metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The Unreason- able Effectiveness of Deep Features as a Perceptual Metric. InCVPR, 2018

  10. [10]

    GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS, 2017

  11. [11]

    UniREditBench: A Unified Reasoning-based Image Editing Benchmark.arXiv preprint arXiv:2511.01295, 2025

    Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, et al. UniREditBench: A Unified Reasoning-based Image Editing Benchmark.arXiv preprint arXiv:2511.01295, 2025

  12. [12]

    Complex- Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

    Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, and Cihang Xie. Complex- Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark. TMLR, 2026

  13. [13]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374, 2021

  14. [14]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-Token Prediction is All You Need. arXiv preprint arXiv:2409.18869, 2024. 18 PA I N TBE N C H: Deterministic Evaluation of Precise Visual Editing

  15. [15]

    Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. InNeurIPS, 2024

  16. [16]

    Beyond Language Modeling: An Exploration of Multimodal Pretraining

    Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, Naila Murray, Marjan Ghazvininejad, Mike Lewis, Nicolas Ballas, Amir Bar, Michael Rabbat, Jakob Verbeek, Luke Zettlemoyer, Koustuv Sinha, Yann LeCun, and Saining Xie. Beyond Language Modeling: An Exploration of Multimod...

  17. [17]

    Unified Multimodal Understand- ing and Generation Models: Advances, Challenges, and Opportunities.arXiv preprint arXiv:2505.02567, 2025

    Shanshan Zhao, Xinjie Zhang, Jintao Guo, Jiakui Hu, Lunhao Duan, Minghao Fu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. Unified Multimodal Understand- ing and Generation Models: Advances, Challenges, and Opportunities.arXiv preprint arXiv:2505.02567, 2025

  18. [18]

    Image Generators are Generalist Vision Learners

    Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T Barron, et al. Image Generators are Generalist Vision Learners.arXiv preprint arXiv:2604.20329, 2026

  19. [19]

    ImgEdit: A Unified Image Editing Dataset and Benchmark

    Yang Ye, Xianyi He, Zongjian Li, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, Li Yuan, et al. ImgEdit: A Unified Image Editing Dataset and Benchmark. InNeurIPS, 2026

  20. [20]

    Human-Aligned MLLM judges for fine- grained image editing evaluation: a benchmark, framework, and analysis.arXiv preprint arXiv:2602.13028, 2026

    Runzhou Liu, Hailey Weingord, Sejal Mittal, Prakhar Dungarwal, Anusha Nandula, Bo Ni, Samyadeep Basu, Hongjie Chen, Nesreen K Ahmed, Li Li, et al. Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis. arXiv preprint arXiv:2602.13028, 2026

  21. [21]

    VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

    Divake Kumar, Sina Tayebati, Devashri Naik, Ranganath Krishnan, and Amit Ranjan Trivedi. VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evalua- tion.arXiv preprint arXiv:2604.25235, 2026

  22. [22]

    Envisioning Beyond the Pixels: Bench- marking Reasoning-Informed Visual Editing

    Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning Beyond the Pixels: Bench- marking Reasoning-Informed Visual Editing. InNeurIPS, 2026

  23. [23]

    KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

    Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, and Xu Yang. KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models. InNeurIPS, 2026

  24. [24]

    GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment. InNeurIPS, 2023

  25. [25]

    T2I-CompBench: A Compre- hensive Benchmark for Open-World Compositional Text-to-Image Generation

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2I-CompBench: A Compre- hensive Benchmark for Open-World Compositional Text-to-Image Generation. InNeurIPS, 2023

  26. [26]

    Geneval 2: Addressing benchmark drift in text-to-image evaluation.ArXiv, abs/2512.16853, 2025

    Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, and Marjan Ghazvininejad. GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation.arXiv preprint arXiv:2512.16853, 2025

  27. [27]

    A Very Big Video Reasoning Suite

    Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, et al. A Very Big Video Reasoning Suite. arXiv preprint arXiv:2602.20159, 2026. 19 PA I N TBE N C H: Deterministic Evaluation of Precise Visual Editing

  28. [28]

    CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. InCVPR, 2017

  29. [29]

    On the Measure of Intelligence

    François Chollet. On the Measure of Intelligence.arXiv preprint arXiv:1911.01547, 2019

  30. [30]

    RAVEN: A Dataset for Relational and Analogical Visual Reasoning

    Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. RAVEN: A Dataset for Relational and Analogical Visual Reasoning. InCVPR, 2019

  31. [31]

    Bongard- LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning

    Weili Nie, Zhiding Yu, Lei Mao, Ankit B Patel, Yuke Zhu, and Anima Anandkumar. Bongard- LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning. InNeurIPS, 2020

  32. [32]

    Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark.arXiv preprint arXiv:2511.13853, 2025

    Xinxin Liu, Zhaopan Xu, Ming Li, Kai Wang, Yong Jae Lee, and Yuzhang Shang. Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark.arXiv preprint arXiv:2511.13853, 2025

  33. [33]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-Image Technical Report.arXiv preprint arXiv:2508.02324, 2025

  34. [34]

    LongCat-Image Technical Report

    Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. LongCat-Image Technical Report. arXiv preprint arXiv:2512.07584, 2025

  35. [35]

    FLUX.2: Frontier Visual Intelligence

    Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025

  36. [36]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging Properties in Unified Multimodal Pretraining.arXiv preprint arXiv:2505.14683, 2025

  37. [37]

    HunyuanImage 3.0 Technical Report

    Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. HunyuanImage 3.0 Technical Report.arXiv preprint arXiv:2509.23951, 2025

  38. [38]

    Introducing Gemini 2.5 Flash Image

    Google. Introducing Gemini 2.5 Flash Image. https://developers.googleblog.com/en/int roducing-gemini-2-5-flash-image/, 2025

  39. [39]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-Modal Early-Fusion Foundation Models.arXiv preprint arXiv:2405.09818, 2024

  40. [40]

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model. InICLR, 2025

  41. [41]

    MetaMorph: Multimodal Under- standing and Generation via Instruction Tuning

    Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. MetaMorph: Multimodal Under- standing and Generation via Instruction Tuning. InICCV, 2025

  42. [42]

    The CIE 1976 Color-Difference Formulae.Color Research & Application, 1977

    Alan R Robertson. The CIE 1976 Color-Difference Formulae.Color Research & Application, 1977

  43. [43]

    Optimizing Color Performance of the Ngenuity 3-Dimensional Visualization System.Ophthalmology Science, 2021

    Samuel A Minaker, Ryan H Mason, and David R Chow. Optimizing Color Performance of the Ngenuity 3-Dimensional Visualization System.Ophthalmology Science, 2021. 20 PA I N TBE N C H: Deterministic Evaluation of Precise Visual Editing

  44. [44]

    Delta E 101.https://zschuessler.github.io/DeltaE/learn/, 2016

    Zachary Schuessler. Delta E 101.https://zschuessler.github.io/DeltaE/learn/, 2016

  45. [45]

    Microsoft COCO: Common Objects in Context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common Objects in Context. InECCV, 2014

  46. [46]

    R” = rotatable; “AR-free

    Bradley Efron and Robert J Tibshirani.An Introduction to the Bootstrap. Chapman and Hall/CRC, 1994. 21 PA I N TBE N C H: Deterministic Evaluation of Precise Visual Editing Appendix This appendix provides benchmark construction details, experimental details, qualitative examples, and extended results supporting the main paper: • (§A)Benchmark Construction:...