arxiv: 2605.04566 · v1 · submitted 2026-05-06 · 💻 cs.CV · cs.CL

Recognition: unknown

Open-Source Image Editing Models Are Zero-Shot Vision Learners

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:39 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords zero-shot learningimage editingmonocular depth estimationsurface normal estimationsemantic segmentationopen-source modelsgenerative models

0 comments

The pith

Open-source image-editing models exhibit non-trivial zero-shot performance on depth estimation, surface normals, and semantic segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether publicly available image-editing models can handle dense visual prediction tasks they were never explicitly trained on. It evaluates three models on standard benchmarks for monocular depth on NYUv2 and DIODE, surface normals on NYUv2, and semantic segmentation on Cityscapes, all without fine-tuning or task-specific data. The models reach competitive accuracy, with one matching or beating fine-tuned baselines on normals and others leading on depth or segmentation metrics. This matters because it indicates that image-editing pretraining alone can produce general scene understanding usable out of the box. A sympathetic reader would see this as evidence that open generative tools already contain latent vision capabilities.

Core claim

The authors establish that open-source image-editing models possess non-trivial zero-shot visual understanding. FireRed-Image-Edit achieves a mean angular error of 17.69 degrees on NYUv2 surface normals, surpassing the fine-tuned Marigold model at 20.86 degrees and matching the instruction-tuned Vision Banana at 17.78 degrees. LongCat-Image-Edit obtains a delta-1 of 0.822 on NYUv2 depth with affine alignment, while Qwen-Image-Edit leads on DIODE indoor depth and reaches 25.7 mIoU on Cityscapes at the 19-class level. Testing three independently trained editors supports the view that this ability emerges from image-editing pretraining rather than model-specific artifacts.

What carries the argument

Direct prompting of image-editing models to output dense predictions for geometry and semantics on tasks outside their training objective.

If this is right

Image-editing pretraining alone teaches enough geometry and semantics for usable zero-shot predictions.
These models can serve as immediate baselines for vision tasks without any additional training.
Performance differences across the three editors point to a general property of the editing objective.
Public code and scripts enable direct comparison and extension to new tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Editing objectives may force internal modeling of 3D structure and object relations that transfer to prediction tasks.
Similar zero-shot transfer could appear in other open generative models trained on reconstruction or manipulation.
Practitioners might prototype vision applications by repurposing existing editors instead of training dedicated networks.
The work invites checks on whether scale or data diversity in editing pretraining drives the observed understanding.

Load-bearing premise

The models have no exposure to the test images and the prompts used to request depth, normals, or segmentation contain no implicit task-specific knowledge.

What would settle it

Reproducing the evaluation on a fresh dataset confirmed to be absent from all training corpora and observing performance fall to random levels or incoherent outputs when prompts are made fully generic.

Figures

Figures reproduced from arXiv: 2605.04566 by Jiaxin Lin, Rui Chen, Wei Liu.

**Figure 1.** Figure 1: Qualitative depth estimation on NYUv2. Top row, left to right: input image and raw view at source ↗

**Figure 2.** Figure 2: Qualitative surface normal estimation on NYUv2. Top row, left to right: input image view at source ↗

**Figure 3.** Figure 3: Qualitative semantic segmentation on Cityscapes ( view at source ↗

**Figure 4.** Figure 4: Qualitative semantic segmentation on Cityscapes ( view at source ↗

read the original abstract

Recent studies have shown that large generative models can solve vision tasks they were not explicitly trained for. However, existing evidence relies on closed-source models~(Veo~3, Nano Banana Pro) or requires task-specific instruction tuning, leaving open whether publicly available image-editing models possess zero-shot vision abilities out of the box. We conduct a systematic evaluation of three open-source image-editing models -- Qwen-Image-Edit, FireRed-Image-Edit, and LongCat-Image-Edit -- on dense visual prediction tasks \emph{without any fine-tuning}. We benchmark monocular depth estimation on NYUv2 and DIODE, surface normal estimation on NYUv2, and semantic segmentation on Cityscapes, covering both geometric and semantic scene understanding. Results show that open-source image-editing models exhibit non-trivial zero-shot visual understanding. On NYUv2 surface normals, FireRed-Image-Edit achieves a mean angular error of $17.69^\circ$, surpassing the fine-tuned Marigold ($20.86^\circ$) and matching the instruction-tuned Vision Banana ($17.78^\circ$) without any task-specific training. On NYUv2 depth estimation, LongCat-Image-Edit obtains $\delta_1{=}0.822$ with affine alignment, and Qwen-Image-Edit leads on DIODE Indoor ($\delta_1{=}0.868$). On Cityscapes semantic segmentation, Qwen-Image-Edit reaches 25.7 mIoU at the 19-class level and 49.5 mIoU at a coarser 7-category level. By comparing three independently trained editors, we test whether zero-shot vision ability is an emergent property of image-editing pretraining rather than a model-specific artifact. Code, evaluation scripts, and all results are publicly released to serve as a reproducible baseline for future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Open image editors get some zero-shot depth and normals numbers but the zero-shot label rests on unshown prompt details and missing contamination checks.

read the letter

The paper's core finding is that three open-source image editors can produce usable zero-shot outputs on monocular depth, surface normals, and semantic segmentation, with FireRed-Image-Edit hitting 17.69° mean angular error on NYUv2 normals and LongCat reaching δ1 of 0.822 on depth after affine alignment. Those numbers sit close to or above some fine-tuned or instruction-tuned baselines without any task-specific training on the target datasets. They also test three separate models to argue the behavior is not an accident of one training run, and they release the evaluation code and scripts, which makes the numbers checkable by others.

Referee Report

3 major / 2 minor

Summary. The paper evaluates three open-source text-conditioned image-editing models (Qwen-Image-Edit, FireRed-Image-Edit, LongCat-Image-Edit) on monocular depth estimation (NYUv2, DIODE), surface normal estimation (NYUv2), and semantic segmentation (Cityscapes) without any fine-tuning or task-specific training. It reports that these models exhibit non-trivial zero-shot visual understanding, with FireRed-Image-Edit achieving 17.69° mean angular error on NYUv2 normals (surpassing fine-tuned Marigold at 20.86° and matching instruction-tuned Vision Banana at 17.78°), LongCat-Image-Edit reaching δ1=0.822 on NYUv2 depth (with affine alignment), Qwen-Image-Edit leading on DIODE indoor depth (δ1=0.868), and Qwen-Image-Edit obtaining 25.7 mIoU (19 classes) / 49.5 mIoU (7 categories) on Cityscapes. The authors argue this ability emerges from image-editing pretraining rather than model-specific artifacts, supported by cross-model comparison, and release code and scripts for reproducibility.

Significance. If the zero-shot protocol is confirmed, the result would be significant: it provides the first systematic evidence that publicly available image-editing models possess emergent dense-prediction capabilities competitive with specialized fine-tuned or instruction-tuned systems, without requiring additional training. The multi-model design strengthens the emergence claim, and the public release of evaluation scripts and results establishes a reproducible baseline for future work on generative-model transfer.

major comments (3)

[§4] §4 (Experimental Setup) and abstract: The exact prompt templates and input formatting used to elicit depth maps, normal maps, and segmentation outputs from the text-conditioned editors are not provided. Because the models are instruction-following, these templates may embed task definitions (e.g., “output a depth map as an image”), which would undermine the zero-shot claim; full disclosure is required to verify the central result.
[§4.2] §4.2 and results tables: No analysis, overlap statistics, or decontamination checks are reported between the pretraining corpora of Qwen-Image-Edit, FireRed-Image-Edit, and LongCat-Image-Edit and the NYUv2/DIODE/Cityscapes test splits. Given the scale of these models, such checks are load-bearing for interpreting the reported metrics (e.g., 17.69° normals, δ1=0.822 depth) as genuine zero-shot generalization rather than memorization.
[Results] Results on depth (NYUv2) and segmentation (Cityscapes): The affine alignment procedure for depth and the post-processing / label mapping for 19-class vs. 7-category mIoU are described only at high level. Without the precise alignment code or mapping details, it is impossible to confirm that the evaluation protocol is identical to the baselines (Marigold, Vision Banana) and that no test-set information leaks into the reported numbers.

minor comments (2)

[Abstract] The abstract cites closed-source models as “Veo 3, Nano Banana Pro” without references or clarification; if these are illustrative, the text should state so explicitly.
[Tables/Figures] Figure and table captions should explicitly state the evaluation protocol (zero-shot vs. aligned) and the exact metrics used for each row to improve readability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity, reproducibility, and transparency where feasible.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup) and abstract: The exact prompt templates and input formatting used to elicit depth maps, normal maps, and segmentation outputs from the text-conditioned editors are not provided. Because the models are instruction-following, these templates may embed task definitions (e.g., “output a depth map as an image”), which would undermine the zero-shot claim; full disclosure is required to verify the central result.

Authors: We agree that full disclosure of the prompts is essential to substantiate the zero-shot claim. In the revised manuscript we will include the complete prompt templates for each task (depth, normals, segmentation) in an expanded Section 4 or dedicated appendix. These are natural-language instructions that specify output format only (e.g., “output a grayscale depth map of the scene”) without task-specific examples or training. The exact prompts are already implemented in the publicly released code, enabling immediate verification. revision: yes
Referee: [§4.2] §4.2 and results tables: No analysis, overlap statistics, or decontamination checks are reported between the pretraining corpora of Qwen-Image-Edit, FireRed-Image-Edit, and LongCat-Image-Edit and the NYUv2/DIODE/Cityscapes test splits. Given the scale of these models, such checks are load-bearing for interpreting the reported metrics (e.g., 17.69° normals, δ1=0.822 depth) as genuine zero-shot generalization rather than memorization.

Authors: We acknowledge the importance of this check for interpreting zero-shot generalization. Unfortunately the full pretraining corpora of these open-source models are not publicly released by their developers, so exhaustive decontamination is not possible. We will add an explicit limitations paragraph in the revised paper discussing this constraint while noting that cross-model consistency and competitive performance on held-out benchmarks support generalization rather than memorization. All evaluation code remains public for community inspection. revision: partial
Referee: [Results] Results on depth (NYUv2) and segmentation (Cityscapes): The affine alignment procedure for depth and the post-processing / label mapping for 19-class vs. 7-category mIoU are described only at high level. Without the precise alignment code or mapping details, it is impossible to confirm that the evaluation protocol is identical to the baselines (Marigold, Vision Banana) and that no test-set information leaks into the reported numbers.

Authors: We will revise Section 4.2 to provide the precise affine alignment formula (standard least-squares scale-and-shift fit) and the exact label remapping used for the 7-category mIoU. The complete implementation, including any post-processing steps, is already contained in the released evaluation scripts and matches the protocols of the cited baselines, ensuring no test-set leakage and identical comparison. revision: yes

standing simulated objections not resolved

Comprehensive decontamination checks between model pretraining corpora and the NYUv2/DIODE/Cityscapes test splits, because the training data of Qwen-Image-Edit, FireRed-Image-Edit, and LongCat-Image-Edit is not publicly disclosed.

Circularity Check

0 steps flagged

No circularity: pure empirical benchmarking with no derivation chain

full rationale

The paper is a direct empirical study that reports measured performance numbers (e.g., angular error 17.69°, δ1=0.822, mIoU 25.7) of three image-editing models on standard external benchmarks (NYUv2, DIODE, Cityscapes) with no fine-tuning. No equations, fitted parameters, predictions, ansatzes, or uniqueness theorems appear in the provided text. All claims rest on tabulated results rather than any reduction to inputs by construction. No self-citation load-bearing steps exist; cited models (Marigold, Vision Banana) are external. The zero-shot interpretation may be debatable on prompt or data-overlap grounds, but that is a validity concern, not circularity in a derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests entirely on empirical measurements from established benchmarks; no free parameters are introduced, no new entities are postulated, and the only background assumptions are standard ones about benchmark validity and zero-shot usage.

axioms (1)

domain assumption NYUv2, DIODE, and Cityscapes are suitable testbeds for assessing zero-shot visual understanding in image-editing models.
The paper applies these datasets directly without additional justification for their relevance to the new use case of editing-model evaluation.

pith-pipeline@v0.9.0 · 5642 in / 1326 out tokens · 63127 ms · 2026-05-08T16:39:45.147638+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 10 canonical work pages · 3 internal anchors

[2]

Dai and Andrea F

Igor Vasiljevic and Nick Kolkin and Shanyi Zhang and Ruotian Luo and Haochen Wang and Falcon Z. Dai and Andrea F. Daniele and Mohammadreza Mostajabi and Steven Basart and Matthew R. Walter and Gregory Shakhnarovich , year =. CoRR , volume=. 2019 , url=

2019
[3]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Cordts, Marius and Omran, Mohamed and Ramos, Sebastian and Rehfeld, Timo and Enzweiler, Markus and Benenson, Rodrigo and Franke, Uwe and Roth, Stefan and Schiele, Bernt , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
[4]

and Savva, Manolis and Halber, Maciej and Funkhouser, Thomas and Nießner, Matthias , booktitle=

Dai, Angela and Chang, Angel X. and Savva, Manolis and Halber, Maciej and Funkhouser, Thomas and Nießner, Matthias , booktitle=. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , year=
[5]

Scene Parsing through ADE20K Dataset , year=

Zhou, Bolei and Zhao, Hang and Puig, Xavier and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio , booktitle=. Scene Parsing through ADE20K Dataset , year=
[6]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
[8]

2026 , eprint=

Image Generators are Generalist Vision Learners , author=. 2026 , eprint=

2026
[10]

2025 , eprint=

Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets , author=. 2025 , eprint=

2025
[11]

2025 , eprint=

Qwen-Image Technical Report , author=. 2025 , eprint=

2025
[13]

2026 , archivePrefix=

FireRed-Image-Edit-1.0 Technical Report , author=. 2026 , archivePrefix=

2026
[14]

2025 , eprint=

A Power Transform , author=. 2025 , eprint=

2025
[15]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Ladicky, Lubor and Shi, Jianbo and Pollefeys, Marc , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
[16]

2015 , month =

Parameter values for the. 2015 , month =

2015
[17]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

2016
[18]

Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T. Barron, Kyle Genova, Nithish Kannen, Sherry Ben, Yandong Li, Mandy Guo, Suhas Yogin, Yiming Gu, Huizhong Chen, Oliver Wang, Saining Xie, Howard Zhou, Kaiming He, Thomas Funkhouser, Jean-Baptiste Alayrac, and Radu ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Lotus: Diffusion-based visual foundation model for high-quality dense prediction

Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Liu, Bingbing Liu, and Ying-Cong Chen. Lotus: Diffusion-based visual foundation model for high-quality dense prediction. arXiv preprint arXiv:2409.18124, 2024

work page arXiv 2024
[20]

Parameter values for the HDTV standards for production and international programme exchange

ITU-R . Parameter values for the HDTV standards for production and international programme exchange. Technical report, International Telecommunication Union, 06 2015. URL https://www.itu.int/rec/R-REC-BT.709-6-201506-I

2015
[21]

Repurposing diffusion-based image generators for monocular depth estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[22]

Pulling things out of perspective

Lubor Ladicky, Jianbo Shi, and Marc Pollefeys. Pulling things out of perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014

2014
[23]

Longcat-image technical report

Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, Xunliang Cai, Yayong Guan, and Jie Hu. Longcat-image technical report. arXiv preprint arXiv:2512.07584, 2025

work page arXiv 2025
[24]

23 Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al

Changhao Qiao, Chao Hui, Chen Li, Cunzheng Wang, Dejia Song, Jiale Zhang, Jing Li, Qiang Xiang, Runqi Wang, Shuang Sun, Wei Zhu, Xu Tang, Yao Hu, Yibo Chen, Yuhao Huang, Yuxuan Duan, Zhiyi Chen, and Ziyuan Guo. Firered-image-edit-1.0 technical report. 2026. URL https://arxiv.org/abs/2602.13344

work page arXiv 2026
[25]

Indoor segmentation and support inference from rgbd images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Computer Vision, ECCV 2012 - 12th European Conference on Computer Vision, Proceedings, number PART 5 in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinforma...

work page doi:10.1007/978-3-642-33715-4 2012
[26]

Dai, Andrea F

Igor Vasiljevic, Nick Kolkin, Shanyi Zhang, Ruotian Luo, Haochen Wang, Falcon Z. Dai, Andrea F. Daniele, Mohammadreza Mostajabi, Steven Basart, Matthew R. Walter, and Gregory Shakhnarovich. DIODE : A D ense I ndoor and O utdoor DE pth D ataset. CoRR, abs/1908.00463, 2019. URL http://arxiv.org/abs/1908.00463

work page arXiv 1908
[27]

Video models are zero-shot learners and reasoners

Thadd \"a us Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328, 2025

work page internal anchor Pith review arXiv 2025
[28]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

work page internal anchor Pith review arXiv 2025
[29]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 5122--5130, 2017. doi:10.1109/CVPR.2017.544

work page doi:10.1109/cvpr.2017.544 2017
[30]

Is nano banana pro a low-level vision all-rounder? a comprehensive evaluation on 14 tasks and 40 datasets.arXiv preprint arXiv:2512.15110,

Jialong Zuo, Haoyou Deng, Hanyu Zhou, Jiaxin Zhu, Yicheng Zhang, Yiwei Zhang, Yongxin Yan, Kaixing Huang, Weisen Chen, Yongtai Deng, Rui Jin, Nong Sang, and Changxin Gao. Is nano banana pro a low-level vision all-rounder? a comprehensive evaluation on 14 tasks and 40 datasets, 2025. URL https://arxiv.org/abs/2512.15110

work page arXiv 2025