PaintBench: Deterministic Evaluation of Precise Visual Editing

Ellis Brown; He He; Kai Xu; Rob Fergus; Saining Xie; Shrikar Madhu

arxiv: 2606.00188 · v1 · pith:Z6D3MTFMnew · submitted 2026-05-29 · 💻 cs.GR · cs.CV· cs.LG

PaintBench: Deterministic Evaluation of Precise Visual Editing

Kai Xu , Ellis Brown , Shrikar Madhu , Rob Fergus , He He , Saining Xie This is my paper

Pith reviewed 2026-06-28 19:18 UTC · model grok-4.3

classification 💻 cs.GR cs.CVcs.LG

keywords PaintBenchprecise visual editingmultimodal modelsbenchmark evaluationprocedural generationmIoUimage manipulationdeterministic evaluation

0 comments

The pith

Current multimodal models score at most 17.1 percent on precise visual editing tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PaintBench is introduced as a benchmark for precise, single-answer visual editing using 20 operations in geometric, structural, color, and symbolic categories. Procedural generation creates scalable, contamination-free tests evaluated deterministically by pixel mIoU. Evaluation of 11 models reveals low performance overall and specific difficulties with geometric transformations and structural changes. Scores on this benchmark correlate strongly with performance on a data visualization editing task, suggesting it measures transferable editing ability. This provides a way to drive progress where open-ended editing is insufficient.

Core claim

PaintBench targets 20 fundamental precise visual editing operations across geometric transformation, structural manipulation, color change, and symbolic reasoning. Its procedural generation with configurable complexity produces an effectively infinite test suite, and deterministic pixel-level mIoU evaluation removes dependence on judge models. Across 11 image editing models the highest score is 17.1 percent mIoU, with geometric transformations, most structural manipulations, and formula-based color changes proving especially difficult; fine-grained diagnostics link performance drops to object count, background complexity, color scheme, and edit-region size. Scores on PaintBench also show str

What carries the argument

PaintBench benchmark of 20 procedurally generated precise visual editing operations evaluated by deterministic pixel-level mean intersection over union.

If this is right

Targeted gains on geometric transformations and structural manipulations would be required to raise overall model performance on precise edits.
Robustness to changes in object count, background complexity, and edit-region size would need to improve for consistent results across scenes.
The observed linear correlation with applied editing performance implies PaintBench scores can forecast success on related tasks without separate testing.
Replacement of judge-model evaluation with pixel mIoU removes one source of bias in measuring editing accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model developers could incorporate PaintBench scores to prioritize training or fine-tuning on the hardest operation categories.
The benchmark's modular design allows straightforward addition of new operation types or complexity levels to cover emerging editing needs.
Model-specific performance patterns suggest that combining outputs from multiple specialized models could raise aggregate accuracy on mixed tasks.
Widespread adoption might shift evaluation standards in the field toward deterministic, pixel-exact metrics for editing rather than open-ended generation.

Load-bearing premise

The 20 selected operations and their procedural generation rules form a representative proxy for all precise visual editing needs.

What would settle it

A model achieving high PaintBench scores yet low accuracy on a broad collection of real user-specified precise edits outside the procedural generation rules would disprove its value as a general proxy.

read the original abstract

While current multimodal models are proficient at open-ended visual editing, executing precise single-answer edits remains an important obstacle. To probe this challenge, we introduce PaintBench, a dynamically scalable benchmark targeting 20 fundamental precise visual editing operations across four categories: geometric transformation, structural manipulation, color change, and symbolic reasoning. Procedural generation with configurable complexity enables an effectively infinite, contamination-resistant evaluation suite, and deterministic pixel-level evaluation eliminates reliance on bias-prone judge models. Across 11 image editing models, we find overall low performance, with the current highest-performing industry leader scoring only 17.1% (mIoU). Task decomposition reveals especially challenging operation types (geometric transformation, most structural manipulation, formula-based color change) and model-specific specializations. Fine-grained benchmark diagnostics further show performance degradations induced by scene variations in object count, background complexity, color scheme, and edit-region size. To test generalization of PaintBench scores to applied task performance, we create a procedural, deterministic evaluation for data visualization editing (TinyGrafixBench) and find strong linear correlation with PaintBench scores ($R^2 = 0.91$, $p < 0.001$). Altogether, PaintBench provides a rigorous foundation for measuring and driving progress in precise multimodal visual editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PaintBench is a clean deterministic benchmark for precise edits with useful diagnostics, but its external-validity claim rests only on correlation to another procedural synthetic task.

read the letter

The paper introduces PaintBench, a procedurally generated benchmark covering 20 exact editing operations across geometric, structural, color, and symbolic categories. It evaluates 11 models with pixel-level mIoU and reports low overall scores, peaking at 17.1% for the best model, plus breakdowns showing geometric transforms and certain structural changes as hardest. The procedural setup and deterministic metrics are the real contribution: they give an effectively unlimited, contamination-free test suite and let the authors measure degradation from factors like object count or edit size without relying on LLM judges.

The correlation to TinyGrafixBench (R²=0.91) is presented as evidence that PaintBench scores track applied performance, but both benchmarks are synthetic and deterministic. That leaves the link to actual user edits with ambiguous instructions untested, which is the main soft spot. The rest of the work looks solid on its own terms: the task decomposition and scene-variation diagnostics are straightforward and reproducible.

This is worth a serious referee for groups building or evaluating precise editing models. The benchmark itself is a practical tool even if the generalization story needs more external checks. I would bring it to a reading group for the methods discussion but would not cite the paper in my own work unless I started using the benchmark directly.

Referee Report

2 major / 1 minor

Summary. The paper introduces PaintBench, a dynamically scalable benchmark for 20 precise visual editing operations in four categories (geometric transformation, structural manipulation, color change, symbolic reasoning). It relies on procedural generation for an effectively infinite, contamination-resistant suite and deterministic pixel-level mIoU evaluation that avoids judge-model bias. Across 11 models the highest score is 17.1% mIoU; task decomposition and scene-variation diagnostics are reported. Generalization is tested via a new procedural TinyGrafixBench for data-visualization editing, yielding R²=0.91 (p<0.001).

Significance. If the central claims hold, PaintBench supplies a reproducible, bias-free, and scalable instrument for measuring progress on precise single-answer editing—an area where current models demonstrably underperform. The procedural generation and deterministic ground-truth evaluation are explicit strengths that remove reliance on subjective judges and enable parameter-free scaling; the reported correlation with TinyGrafixBench provides an internal consistency check between two synthetic benchmarks.

major comments (2)

[Abstract / TinyGrafixBench section] Abstract and the section describing TinyGrafixBench: the assertion that the R²=0.91 correlation demonstrates generalization 'to applied task performance' is not supported by the evidence presented. Both PaintBench and TinyGrafixBench are procedurally generated with deterministic pixel-level ground truth; neither incorporates ambiguous natural-language instructions nor non-synthetic image distributions typical of real user edits. This leaves the external-validity step untested and weakens the claim that PaintBench scores will drive progress on real-world applications.
[Evaluation / correlation analysis] Evaluation and correlation sections: the manuscript does not supply the data-exclusion rules, outlier-handling procedure, or error-analysis details used to compute the reported mIoU values and the R²=0.91 correlation. Without these, it is impossible to determine whether post-hoc choices affect the central performance rankings or the linear-fit result.

minor comments (1)

[Benchmark definition] A table enumerating all 20 operations with their exact procedural parameters and complexity controls would improve reproducibility and allow readers to assess coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will make revisions to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract / TinyGrafixBench section] Abstract and the section describing TinyGrafixBench: the assertion that the R²=0.91 correlation demonstrates generalization 'to applied task performance' is not supported by the evidence presented. Both PaintBench and TinyGrafixBench are procedurally generated with deterministic pixel-level ground truth; neither incorporates ambiguous natural-language instructions nor non-synthetic image distributions typical of real user edits. This leaves the external-validity step untested and weakens the claim that PaintBench scores will drive progress on real-world applications.

Authors: We agree that the original phrasing 'to applied task performance' overstates the scope of the evidence. Both benchmarks rely on procedural generation and deterministic evaluation, so the correlation shows consistency across synthetic editing tasks rather than generalization to real-world edits with natural images or ambiguous instructions. We will revise the abstract and TinyGrafixBench section to state that the correlation indicates generalization to other procedural editing domains, removing any implication of real-world applicability. revision: yes
Referee: [Evaluation / correlation analysis] Evaluation and correlation sections: the manuscript does not supply the data-exclusion rules, outlier-handling procedure, or error-analysis details used to compute the reported mIoU values and the R²=0.91 correlation. Without these, it is impossible to determine whether post-hoc choices affect the central performance rankings or the linear-fit result.

Authors: The manuscript indeed omits these procedural details. In the revision we will add an explicit subsection under Evaluation describing the computation pipeline: all procedurally generated samples were retained with no exclusions or outlier removal; mIoU was averaged per model across all tasks and scenes; the R² value was obtained via ordinary least-squares regression on the 11 model-level mean mIoU scores; and we will report the exact sample counts and any sensitivity checks performed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark definition and validation are independent

full rationale

The paper constructs PaintBench via procedural generation of 20 operations with deterministic pixel-level mIoU ground truth, then reports model scores and a correlation (R²=0.91) to a separately defined TinyGrafixBench. Neither the benchmark definition nor the reported correlation reduces to a self-referential fit, self-citation chain, or imported uniqueness theorem; both benchmarks are constructed independently of model outputs and the correlation is presented as an external check rather than a tautological prediction. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the chosen 20 operations and mIoU metric capture the essence of precise editing without introducing selection bias; no free parameters, invented entities, or non-standard axioms are invoked in the abstract.

axioms (2)

domain assumption Procedural generation with configurable complexity produces contamination-resistant test cases
Invoked to justify the infinite suite and lack of data leakage concerns
domain assumption Pixel-level mIoU is an appropriate and unbiased measure of precise editing success
Used to eliminate reliance on judge models

pith-pipeline@v0.9.1-grok · 5766 in / 1357 out tokens · 16344 ms · 2026-06-28T19:18:06.125230+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 18 canonical work pages · 11 internal anchors

[1]

InstructPix2Pix: Learning to Follow Image Editing Instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. InstructPix2Pix: Learning to Follow Image Editing Instructions. InCVPR, 2023

2023
[2]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space.arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

ChatGPT Images 2.0 System Card

OpenAI. ChatGPT Images 2.0 System Card. https://deploymentsafety.openai.com/chat gpt-images-2-0/introduction, 2026

2026
[4]

Nano Banana 2 (Gemini 3.1 Flash Image)

Google. Nano Banana 2 (Gemini 3.1 Flash Image). https://deepmind.google/models/gemin i-image/flash/, 2026

2026
[5]

MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing. InNeurIPS, 2023

2023
[6]

EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods.arXiv preprint arXiv:2310.02426, 2023

Samyadeep Basu, Mehrdad Saberi, Shweta Bhardwaj, Atoosa Malemir Chegini, Daniela Massiceti, Maziar Sanjabi, Shell Xu Hu, and Soheil Feizi. EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods.arXiv preprint arXiv:2310.02426, 2023

work page arXiv 2023
[7]

I2EBench: A Comprehensive Benchmark for Instruction-Based Image Editing

Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, and Rongrong Ji. I2EBench: A Comprehensive Benchmark for Instruction-Based Image Editing. InNeurIPS, 2024

2024
[8]

Image Quality Assessment: From Error Visibility to Structural Similarity.IEEE TIP, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image Quality Assessment: From Error Visibility to Structural Similarity.IEEE TIP, 2004

2004
[9]

The Unreason- able Effectiveness of Deep Features as a Perceptual Metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The Unreason- able Effectiveness of Deep Features as a Perceptual Metric. InCVPR, 2018

2018
[10]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS, 2017

2017
[11]

UniREditBench: A Unified Reasoning-based Image Editing Benchmark.arXiv preprint arXiv:2511.01295, 2025

Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, et al. UniREditBench: A Unified Reasoning-based Image Editing Benchmark.arXiv preprint arXiv:2511.01295, 2025

work page arXiv 2025
[12]

Complex- Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, and Cihang Xie. Complex- Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark. TMLR, 2026

2026
[13]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-Token Prediction is All You Need. arXiv preprint arXiv:2409.18869, 2024. 18 PA I N TBE N C H: Deterministic Evaluation of Precise Visual Editing

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. InNeurIPS, 2024

2024
[16]

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, Naila Murray, Marjan Ghazvininejad, Mike Lewis, Nicolas Ballas, Amir Bar, Michael Rabbat, Jakob Verbeek, Luke Zettlemoyer, Koustuv Sinha, Yann LeCun, and Saining Xie. Beyond Language Modeling: An Exploration of Multimod...

2026
[17]

Unified Multimodal Understand- ing and Generation Models: Advances, Challenges, and Opportunities.arXiv preprint arXiv:2505.02567, 2025

Shanshan Zhao, Xinjie Zhang, Jintao Guo, Jiakui Hu, Lunhao Duan, Minghao Fu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. Unified Multimodal Understand- ing and Generation Models: Advances, Challenges, and Opportunities.arXiv preprint arXiv:2505.02567, 2025

work page arXiv 2025
[18]

Image Generators are Generalist Vision Learners

Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T Barron, et al. Image Generators are Generalist Vision Learners.arXiv preprint arXiv:2604.20329, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, Li Yuan, et al. ImgEdit: A Unified Image Editing Dataset and Benchmark. InNeurIPS, 2026

2026
[20]

Human-Aligned MLLM judges for fine- grained image editing evaluation: a benchmark, framework, and analysis.arXiv preprint arXiv:2602.13028, 2026

Runzhou Liu, Hailey Weingord, Sejal Mittal, Prakhar Dungarwal, Anusha Nandula, Bo Ni, Samyadeep Basu, Hongjie Chen, Nesreen K Ahmed, Li Li, et al. Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis. arXiv preprint arXiv:2602.13028, 2026

work page arXiv 2026
[21]

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

Divake Kumar, Sina Tayebati, Devashri Naik, Ranganath Krishnan, and Amit Ranjan Trivedi. VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evalua- tion.arXiv preprint arXiv:2604.25235, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Envisioning Beyond the Pixels: Bench- marking Reasoning-Informed Visual Editing

Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning Beyond the Pixels: Bench- marking Reasoning-Informed Visual Editing. InNeurIPS, 2026

2026
[23]

KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, and Xu Yang. KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models. InNeurIPS, 2026

2026
[24]

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment. InNeurIPS, 2023

2023
[25]

T2I-CompBench: A Compre- hensive Benchmark for Open-World Compositional Text-to-Image Generation

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2I-CompBench: A Compre- hensive Benchmark for Open-World Compositional Text-to-Image Generation. InNeurIPS, 2023

2023
[26]

Geneval 2: Addressing benchmark drift in text-to-image evaluation.ArXiv, abs/2512.16853, 2025

Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, and Marjan Ghazvininejad. GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation.arXiv preprint arXiv:2512.16853, 2025

work page arXiv 2025
[27]

A Very Big Video Reasoning Suite

Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, et al. A Very Big Video Reasoning Suite. arXiv preprint arXiv:2602.20159, 2026. 19 PA I N TBE N C H: Deterministic Evaluation of Precise Visual Editing

work page arXiv 2026
[28]

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. InCVPR, 2017

2017
[29]

On the Measure of Intelligence

François Chollet. On the Measure of Intelligence.arXiv preprint arXiv:1911.01547, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[30]

RAVEN: A Dataset for Relational and Analogical Visual Reasoning

Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. RAVEN: A Dataset for Relational and Analogical Visual Reasoning. InCVPR, 2019

2019
[31]

Bongard- LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning

Weili Nie, Zhiding Yu, Lei Mao, Ankit B Patel, Yuke Zhu, and Anima Anandkumar. Bongard- LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning. InNeurIPS, 2020

2020
[32]

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark.arXiv preprint arXiv:2511.13853, 2025

Xinxin Liu, Zhaopan Xu, Ming Li, Kai Wang, Yong Jae Lee, and Yuzhang Shang. Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark.arXiv preprint arXiv:2511.13853, 2025

work page arXiv 2025
[33]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-Image Technical Report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

LongCat-Image Technical Report

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. LongCat-Image Technical Report. arXiv preprint arXiv:2512.07584, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

FLUX.2: Frontier Visual Intelligence

Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025

2025
[36]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging Properties in Unified Multimodal Pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. HunyuanImage 3.0 Technical Report.arXiv preprint arXiv:2509.23951, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Introducing Gemini 2.5 Flash Image

Google. Introducing Gemini 2.5 Flash Image. https://developers.googleblog.com/en/int roducing-gemini-2-5-flash-image/, 2025

2025
[39]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-Modal Early-Fusion Foundation Models.arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model. InICLR, 2025

2025
[41]

MetaMorph: Multimodal Under- standing and Generation via Instruction Tuning

Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. MetaMorph: Multimodal Under- standing and Generation via Instruction Tuning. InICCV, 2025

2025
[42]

The CIE 1976 Color-Difference Formulae.Color Research & Application, 1977

Alan R Robertson. The CIE 1976 Color-Difference Formulae.Color Research & Application, 1977

1976
[43]

Optimizing Color Performance of the Ngenuity 3-Dimensional Visualization System.Ophthalmology Science, 2021

Samuel A Minaker, Ryan H Mason, and David R Chow. Optimizing Color Performance of the Ngenuity 3-Dimensional Visualization System.Ophthalmology Science, 2021. 20 PA I N TBE N C H: Deterministic Evaluation of Precise Visual Editing

2021
[44]

Delta E 101.https://zschuessler.github.io/DeltaE/learn/, 2016

Zachary Schuessler. Delta E 101.https://zschuessler.github.io/DeltaE/learn/, 2016

2016
[45]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common Objects in Context. InECCV, 2014

2014
[46]

R” = rotatable; “AR-free

Bradley Efron and Robert J Tibshirani.An Introduction to the Bootstrap. Chapman and Hall/CRC, 1994. 21 PA I N TBE N C H: Deterministic Evaluation of Precise Visual Editing Appendix This appendix provides benchmark construction details, experimental details, qualitative examples, and extended results supporting the main paper: • (§A)Benchmark Construction:...

1994

[1] [1]

InstructPix2Pix: Learning to Follow Image Editing Instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. InstructPix2Pix: Learning to Follow Image Editing Instructions. InCVPR, 2023

2023

[2] [2]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space.arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

ChatGPT Images 2.0 System Card

OpenAI. ChatGPT Images 2.0 System Card. https://deploymentsafety.openai.com/chat gpt-images-2-0/introduction, 2026

2026

[4] [4]

Nano Banana 2 (Gemini 3.1 Flash Image)

Google. Nano Banana 2 (Gemini 3.1 Flash Image). https://deepmind.google/models/gemin i-image/flash/, 2026

2026

[5] [5]

MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing. InNeurIPS, 2023

2023

[6] [6]

EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods.arXiv preprint arXiv:2310.02426, 2023

Samyadeep Basu, Mehrdad Saberi, Shweta Bhardwaj, Atoosa Malemir Chegini, Daniela Massiceti, Maziar Sanjabi, Shell Xu Hu, and Soheil Feizi. EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods.arXiv preprint arXiv:2310.02426, 2023

work page arXiv 2023

[7] [7]

I2EBench: A Comprehensive Benchmark for Instruction-Based Image Editing

Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, and Rongrong Ji. I2EBench: A Comprehensive Benchmark for Instruction-Based Image Editing. InNeurIPS, 2024

2024

[8] [8]

Image Quality Assessment: From Error Visibility to Structural Similarity.IEEE TIP, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image Quality Assessment: From Error Visibility to Structural Similarity.IEEE TIP, 2004

2004

[9] [9]

The Unreason- able Effectiveness of Deep Features as a Perceptual Metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The Unreason- able Effectiveness of Deep Features as a Perceptual Metric. InCVPR, 2018

2018

[10] [10]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS, 2017

2017

[11] [11]

UniREditBench: A Unified Reasoning-based Image Editing Benchmark.arXiv preprint arXiv:2511.01295, 2025

Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, et al. UniREditBench: A Unified Reasoning-based Image Editing Benchmark.arXiv preprint arXiv:2511.01295, 2025

work page arXiv 2025

[12] [12]

Complex- Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, and Cihang Xie. Complex- Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark. TMLR, 2026

2026

[13] [13]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-Token Prediction is All You Need. arXiv preprint arXiv:2409.18869, 2024. 18 PA I N TBE N C H: Deterministic Evaluation of Precise Visual Editing

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. InNeurIPS, 2024

2024

[16] [16]

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, Naila Murray, Marjan Ghazvininejad, Mike Lewis, Nicolas Ballas, Amir Bar, Michael Rabbat, Jakob Verbeek, Luke Zettlemoyer, Koustuv Sinha, Yann LeCun, and Saining Xie. Beyond Language Modeling: An Exploration of Multimod...

2026

[17] [17]

Unified Multimodal Understand- ing and Generation Models: Advances, Challenges, and Opportunities.arXiv preprint arXiv:2505.02567, 2025

Shanshan Zhao, Xinjie Zhang, Jintao Guo, Jiakui Hu, Lunhao Duan, Minghao Fu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. Unified Multimodal Understand- ing and Generation Models: Advances, Challenges, and Opportunities.arXiv preprint arXiv:2505.02567, 2025

work page arXiv 2025

[18] [18]

Image Generators are Generalist Vision Learners

Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T Barron, et al. Image Generators are Generalist Vision Learners.arXiv preprint arXiv:2604.20329, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, Li Yuan, et al. ImgEdit: A Unified Image Editing Dataset and Benchmark. InNeurIPS, 2026

2026

[20] [20]

Human-Aligned MLLM judges for fine- grained image editing evaluation: a benchmark, framework, and analysis.arXiv preprint arXiv:2602.13028, 2026

Runzhou Liu, Hailey Weingord, Sejal Mittal, Prakhar Dungarwal, Anusha Nandula, Bo Ni, Samyadeep Basu, Hongjie Chen, Nesreen K Ahmed, Li Li, et al. Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis. arXiv preprint arXiv:2602.13028, 2026

work page arXiv 2026

[21] [21]

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

Divake Kumar, Sina Tayebati, Devashri Naik, Ranganath Krishnan, and Amit Ranjan Trivedi. VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evalua- tion.arXiv preprint arXiv:2604.25235, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

Envisioning Beyond the Pixels: Bench- marking Reasoning-Informed Visual Editing

Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning Beyond the Pixels: Bench- marking Reasoning-Informed Visual Editing. InNeurIPS, 2026

2026

[23] [23]

KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, and Xu Yang. KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models. InNeurIPS, 2026

2026

[24] [24]

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment. InNeurIPS, 2023

2023

[25] [25]

T2I-CompBench: A Compre- hensive Benchmark for Open-World Compositional Text-to-Image Generation

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2I-CompBench: A Compre- hensive Benchmark for Open-World Compositional Text-to-Image Generation. InNeurIPS, 2023

2023

[26] [26]

Geneval 2: Addressing benchmark drift in text-to-image evaluation.ArXiv, abs/2512.16853, 2025

Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, and Marjan Ghazvininejad. GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation.arXiv preprint arXiv:2512.16853, 2025

work page arXiv 2025

[27] [27]

A Very Big Video Reasoning Suite

Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, et al. A Very Big Video Reasoning Suite. arXiv preprint arXiv:2602.20159, 2026. 19 PA I N TBE N C H: Deterministic Evaluation of Precise Visual Editing

work page arXiv 2026

[28] [28]

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. InCVPR, 2017

2017

[29] [29]

On the Measure of Intelligence

François Chollet. On the Measure of Intelligence.arXiv preprint arXiv:1911.01547, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[30] [30]

RAVEN: A Dataset for Relational and Analogical Visual Reasoning

Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. RAVEN: A Dataset for Relational and Analogical Visual Reasoning. InCVPR, 2019

2019

[31] [31]

Bongard- LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning

Weili Nie, Zhiding Yu, Lei Mao, Ankit B Patel, Yuke Zhu, and Anima Anandkumar. Bongard- LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning. InNeurIPS, 2020

2020

[32] [32]

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark.arXiv preprint arXiv:2511.13853, 2025

Xinxin Liu, Zhaopan Xu, Ming Li, Kai Wang, Yong Jae Lee, and Yuzhang Shang. Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark.arXiv preprint arXiv:2511.13853, 2025

work page arXiv 2025

[33] [33]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-Image Technical Report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

LongCat-Image Technical Report

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. LongCat-Image Technical Report. arXiv preprint arXiv:2512.07584, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

FLUX.2: Frontier Visual Intelligence

Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025

2025

[36] [36]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging Properties in Unified Multimodal Pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. HunyuanImage 3.0 Technical Report.arXiv preprint arXiv:2509.23951, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Introducing Gemini 2.5 Flash Image

Google. Introducing Gemini 2.5 Flash Image. https://developers.googleblog.com/en/int roducing-gemini-2-5-flash-image/, 2025

2025

[39] [39]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-Modal Early-Fusion Foundation Models.arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model. InICLR, 2025

2025

[41] [41]

MetaMorph: Multimodal Under- standing and Generation via Instruction Tuning

Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. MetaMorph: Multimodal Under- standing and Generation via Instruction Tuning. InICCV, 2025

2025

[42] [42]

The CIE 1976 Color-Difference Formulae.Color Research & Application, 1977

Alan R Robertson. The CIE 1976 Color-Difference Formulae.Color Research & Application, 1977

1976

[43] [43]

Optimizing Color Performance of the Ngenuity 3-Dimensional Visualization System.Ophthalmology Science, 2021

Samuel A Minaker, Ryan H Mason, and David R Chow. Optimizing Color Performance of the Ngenuity 3-Dimensional Visualization System.Ophthalmology Science, 2021. 20 PA I N TBE N C H: Deterministic Evaluation of Precise Visual Editing

2021

[44] [44]

Delta E 101.https://zschuessler.github.io/DeltaE/learn/, 2016

Zachary Schuessler. Delta E 101.https://zschuessler.github.io/DeltaE/learn/, 2016

2016

[45] [45]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common Objects in Context. InECCV, 2014

2014

[46] [46]

R” = rotatable; “AR-free

Bradley Efron and Robert J Tibshirani.An Introduction to the Bootstrap. Chapman and Hall/CRC, 1994. 21 PA I N TBE N C H: Deterministic Evaluation of Precise Visual Editing Appendix This appendix provides benchmark construction details, experimental details, qualitative examples, and extended results supporting the main paper: • (§A)Benchmark Construction:...

1994