arxiv: 2604.03400 · v1 · submitted 2026-04-03 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro

Kenan Tang , Praveen Arunshankar , Andong Hua , Anthony Yang , Yao Qin

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords NR-IQAimage quality assessmentiterative editingBanana100multi-modal agentsimage degradationsynthetic data

0 comments

The pith

Iterative image editing over 100 steps creates severely degraded images that standard quality metrics fail to rate lower than clean originals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that multi-turn image editing with AI agents causes progressive degradation through accumulation of artifacts and noise. It introduces the Banana100 dataset containing 28,000 images generated via 100 iterative editing steps across diverse content. Testing reveals that none of 21 popular NR-IQA metrics consistently score the heavily degraded final images lower than the initial clean versions. This failure could allow low-quality synthetic data to bypass filters in model training, risking instability in agentic systems.

Core claim

Using models such as Nano Banana Pro for 100 iterative replication and editing steps produces images with visible noise and instruction-following failures. When evaluated with 21 no-reference image quality assessment metrics, none consistently assign lower quality scores to these degraded images compared to their clean starting points, despite the clear visual deterioration.

What carries the argument

Banana100 dataset of images degraded through 100 iterative editing steps, used to test and reveal the limitations of NR-IQA metrics in detecting accumulated artifacts.

If this is right

Multi-turn editing in agentic systems risks generating undetectable low-quality images.
Quality filters based on current NR-IQA metrics may fail to prevent degraded data from entering training pipelines.
The stability of future multi-modal models could be compromised by reliance on these flawed evaluators.
Development of new robust quality assessment methods is necessary for safe deployment of iterative editing agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar issues may arise in other iterative generative processes beyond images, such as in video or 3D content creation.
Agent designs could benefit from built-in quality checkpoints at fewer steps rather than relying on end-of-process metrics.
Human preference studies might be needed to calibrate new metrics for detecting iterative degradation specifically.

Load-bearing premise

That the degradation from 100 iterative edits is a form of quality loss that NR-IQA metrics should be expected to detect and penalize.

What would settle it

Finding even one NR-IQA metric among the tested ones that assigns consistently lower scores to the 100-step degraded images than to the originals across the Banana100 dataset would contradict the claim.

Figures

Figures reproduced from arXiv: 2604.03400 by Andong Hua, Anthony Yang, Kenan Tang, Praveen Arunshankar, Yao Qin.

**Figure 1.** Figure 1: Iteratively replicating an image using Nano Banana Pro severely degrades an image, but BRISQUE assigns better (lower) scores to images of worse quality. BRISQUE is a No-Reference Image Quality Assessment (NR-IQA) metric that has been widely used to assess the quality of AI-generated images. This counter-intuitive failure is pervasive across diverse NR-IQA metrics and image textures. Abstract The multi-step… view at source ↗

**Figure 2.** Figure 2: The reasoning summary from Nano Banana Pro appears as clear-cut generation and evaluation sections. The bold text are section titles, copied verbatim from the reasoning summary from the Nano Banana Pro API. In this example, the first two sections are dedicated to image generation, whereas the last two sections are dedicated to the evaluation of a generated image. 3. Analysis of Instruction Following Fail… view at source ↗

**Figure 3.** Figure 3: A summary of the failure modes of instruction following, categorized into sub-object level (blue), object level (yellow), and image level (green). The images have been cropped and zoomed for visual clarity. As the failures were consistent across different runs and editing steps, we do not report the exact run index and step index for each image here. See Section 3 for details. prominent colors (red and … view at source ↗

**Figure 4.** Figure 4: We normalized NR-IQA scores (BRISQUE as an example) and calculated the difference across steps to quantify the score trend. Please see Section 4.1 for details. The three ∆ values can also be found at the intersection of the second row (Dongpo) and the second column in each heatmap of [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Aggregated results show that none of the 21 NR-IQA metrics fully succeed on all images. This heatmap overlays the 3 heatmaps from [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Despite a consistent drop in image quality, the RALI score (higher is better) fluctuates over the steps. The fluctuation shows that RALI is not robust against the semantic change caused by iterative object addition (add-100-fruits). 4.3. Self-Evaluation is Delayed In the reasoning summary, Nano Banana Pro comments on the original image in the generation section. The comment sometimes mentions the degradati… view at source ↗

**Figure 10.** Figure 10: Similar to the evaluation of Nano-Banana-Pro generated results, NR-IQA metrics also fail for results from 3 more models. No metric succeeds on all initial images and all models. Interestingly, PI and PIQE fully succeed on Qwen Image Edit, but fails for almost all initial images for Nano Banana 2 Fast. The diverse failure patterns across metrics further confirm the difference of the noise patterns from e… view at source ↗

**Figure 9.** Figure 9: Different noise accumulates during image replication by 3 more models. Nano Banana 2 Fast (without reasoning) generates wrinkles that align with the contours of the objects. FLUX.2 [dev] generates scatter points on many of the objects. Qwen Image Edit simplifies the texture and erroneously duplicates objects on the right side of the image to the left side of the image. Building Dongpo Ekphrasis Fog Holi … view at source ↗

read the original abstract

The multi-step, iterative image editing capabilities of multi-modal agentic systems have transformed digital content creation. Although latest image editing models faithfully follow instructions and generate high-quality images in single-turn edits, we identify a critical weakness in multi-turn editing, which is the iterative degradation of image quality. As images are repeatedly edited, minor artifacts accumulate, rapidly leading to a severe accumulation of visible noise and a failure to follow simple editing instructions. To systematically study these failures, we introduce Banana100, a comprehensive dataset of 28,000 degraded images generated through 100 iterative editing steps, including diverse textures and image content. Alarmingly, image quality evaluators fail to detect the degradation. Among 21 popular no-reference image quality assessment (NR-IQA) metrics, none of them consistently assign lower scores to heavily degraded images than to clean ones. The dual failures of generators and evaluators may threaten the stability of future model training and the safety of deployed agentic systems, if the low-quality synthetic data generated by multi-turn edits escape quality filters. We release the full code and data to facilitate the development of more robust models, helping to mitigate the fragility of multi-modal agentic systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Banana100 flags a blind spot in NR-IQA metrics for accumulated iterative edits but the polarity handling needs explicit confirmation.

read the letter

The main thing to know about the Banana100 paper is that it flags a practical blind spot in NR-IQA metrics for images that accumulate artifacts over many iterative edits, backed by a new dataset, though the results depend on how they handled differing score scales. They built Banana100 by taking images through 100 steps of editing with their Nano Banana Pro setup, producing 28,000 degraded samples across varied textures and content. The core result is that none of the 21 tested metrics reliably gave worse scores to these heavily edited versions than to the originals. Putting out the code and data is a solid move that lets others dig into the exact process. What works here is the focus on multi-turn degradation in agentic systems, which goes beyond the usual single-distortion tests. It's a timely empirical check on whether quality filters can catch the kind of drift that happens in repeated generation. The main concern is the one about score polarity. Not all metrics treat numbers the same way—BRISQUE and similar ones score lower for better quality, while learned ones often do the reverse. Without showing that they inverted or rescaled to a common scale before comparing, the statement that none assign lower scores to degraded images risks being misleading. The paper should clarify this in the methods or results section. Beyond that, the claim assumes these iterative changes are the type of quality loss the metrics are supposed to catch, which might not hold if the edits change semantics more than perceptual quality. This is useful for folks in image quality research and synthetic data pipelines. It has a clear new dataset and a testable claim, so it should go to peer review rather than get desk rejected. Referees can sort out the polarity details and check the stats.

Referee Report

2 major / 1 minor

Summary. The paper introduces Banana100, a dataset of 28,000 images produced by applying 100 iterative editing steps to diverse source images using multi-modal agentic systems. It reports that none of 21 popular NR-IQA metrics consistently assign lower scores to the resulting heavily degraded images than to the clean originals, and argues that this reveals a dual weakness in generators (accumulating artifacts) and evaluators (failing to detect them), with potential risks for synthetic data in model training. The authors release code and data.

Significance. If the empirical results survive correction for metric polarity and are supported by appropriate statistical controls, the work would usefully document a practical limitation of current NR-IQA metrics when confronted with accumulated, instruction-driven degradations rather than conventional distortions. The public release of code and data is a clear strength that supports reproducibility and follow-on research on robust quality assessment for agentic image pipelines.

major comments (2)

[Abstract] Abstract: the central claim that 'none of them consistently assign lower scores to heavily degraded images than to clean ones' presupposes a uniform polarity in which lower numerical output always indicates worse quality. Standard implementations differ (BRISQUE, NIQE, and PIQE output lower values for higher quality; many learned regressors output higher values for higher quality). Without explicit inversion or rescaling to a common convention before comparison, the reported failure cannot be interpreted as genuine insensitivity to the Banana100 degradations.
[Methods] The manuscript provides no description of how 'consistently' was operationalized across the 28,000 images (e.g., fraction of images for which the degraded score is worse, any statistical test, or handling of ties). This detail is load-bearing for the claim that the metrics 'fail to detect the degradation.'

minor comments (1)

A supplementary table listing the 21 metrics together with their native polarity (higher-better or lower-better) and the exact implementation version used would improve clarity and allow readers to verify the comparison protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments correctly identify areas where the manuscript lacks sufficient clarity on metric handling and experimental definitions. We will make major revisions to address both points explicitly while preserving the core empirical findings.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'none of them consistently assign lower scores to heavily degraded images than to clean ones' presupposes a uniform polarity in which lower numerical output always indicates worse quality. Standard implementations differ (BRISQUE, NIQE, and PIQE output lower values for higher quality; many learned regressors output higher values for higher quality). Without explicit inversion or rescaling to a common convention before comparison, the reported failure cannot be interpreted as genuine insensitivity to the Banana100 degradations.

Authors: The referee is correct that the manuscript does not describe polarity normalization. In our analysis we inverted scores for BRISQUE, NIQE, and PIQE (and any other metrics with opposite polarity) so that, after normalization, lower values always indicate worse quality. This step was performed but omitted from the text. We will add a clear Methods subsection listing all inverted metrics and the normalization procedure. revision: yes
Referee: [Methods] The manuscript provides no description of how 'consistently' was operationalized across the 28,000 images (e.g., fraction of images for which the degraded score is worse, any statistical test, or handling of ties). This detail is load-bearing for the claim that the metrics 'fail to detect the degradation.'

Authors: We agree that the operational definition of 'consistently' must be stated explicitly. We defined it as the case in which the normalized score for the degraded image was not worse than the original in the majority of pairs (i.e., failure rate > 50 %). Ties were counted as failures to detect degradation. We will revise the Methods section to state this threshold, describe tie handling, and report per-metric fractions together with a paired sign test for statistical significance. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmarking without derivation or fitted inputs

full rationale

The paper is an empirical study that generates a dataset of iteratively degraded images and directly compares scores from 21 existing NR-IQA metrics on clean versus degraded versions. No mathematical derivation chain, parameter fitting, self-definitional equations, or load-bearing self-citations are present. The central claim rests on observed score comparisons rather than any reduction to prior inputs by construction. This is a standard experimental benchmarking setup that remains self-contained against external metrics and data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that iterative editing produces accumulating artifacts that should be detectable as quality loss by existing NR-IQA metrics.

axioms (1)

domain assumption Iterative image editing accumulates visible noise and artifacts that standard NR-IQA metrics should detect as reduced quality.
This premise underpins the expectation that metrics will assign lower scores to degraded images.

pith-pipeline@v0.9.0 · 5524 in / 1299 out tokens · 79362 ms · 2026-05-13T20:14:48.779098+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Among 21 popular no-reference image quality assessment (NR-IQA) metrics, none of them consistently assign lower scores to heavily degraded images than to clean ones.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We normalized NR-IQA scores (BRISQUE as an example) and calculated the difference across steps to quantify the score trend.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 4 internal anchors

[1]

When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators

Krzysztof Adamkiewicz, Brian Moser, Stanislav Frolov, To- bias Christian Nauen, Federico Raue, and Andreas Dengel. When pretty isn’t useful: Investigating why modern text-to- image models fail as reliable training data generators.arXiv preprint arXiv:2602.19946, 2026. 2

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Quality-aware image-text alignment for opinion-unaware image quality assessment.arXiv preprint arXiv:2403.11176,

Lorenzo Agnolucci, Leonardo Galteri, and Marco Bertini. Quality-aware image-text alignment for opinion-unaware image quality assessment.arXiv preprint arXiv:2403.11176,

work page arXiv
[3]

Arniqa: Learning distortion mani- fold for image quality assessment

Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini, and Alberto Del Bimbo. Arniqa: Learning distortion mani- fold for image quality assessment. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 189–198, 2024. 6

work page 2024
[4]

Reed-vae: Re- encode decode training for iterative image editing with dif- fusion models

Gal Almog, Ariel Shamir, and Ohad Fried. Reed-vae: Re- encode decode training for iterative image editing with dif- fusion models. InComputer Graphics Forum, page e70020. Wiley Online Library, 2025. 2

work page 2025
[5]

10006 attemptA turn5.png.https : / / ml - site

Apple. 10006 attemptA turn5.png.https : / / ml - site . cdn - apple . com / datasets / pico - banana- 300k/nb/images/multi- turn/10006_ attemptA_turn5.png, 2026. Accessed: 2026-03-18. 2

work page 2026
[6]

10006 attemptA turn6.png.https : / / ml - site

Apple. 10006 attemptA turn6.png.https : / / ml - site . cdn - apple . com / datasets / pico - banana- 300k/nb/images/multi- turn/10006_ attemptA_turn6.png, 2026. Accessed: 2026-03-18. 2

work page 2026
[7]

Image Editing AI Leaderboard - Best Mod- els Compared.https://arena.ai/leaderboard/ image-edit, 2026

Arena AI. Image Editing AI Leaderboard - Best Mod- els Compared.https://arena.ai/leaderboard/ image-edit, 2026. Accessed: 2026-03-14. 4

work page 2026
[8]

FLUX.2: Frontier Visual Intelligence

Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux- 2, 2025. Accessed: 2026-03-14. 1, 8

work page 2025
[9]

The 2018 pirm challenge on percep- tual image super-resolution

Yochai Blau, Roey Mechrez, Radu Timofte, Tomer Michaeli, and Lihi Zelnik-Manor. The 2018 pirm challenge on percep- tual image super-resolution. InProceedings of the European conference on computer vision (ECCV) workshops, pages 0– 0, 2018. 6

work page 2018
[10]

Deep neural net- works for no-reference and full-reference image quality as- sessment.IEEE Transactions on image processing, 27(1): 206–219, 2017

Sebastian Bosse, Dominique Maniry, Klaus-Robert M ¨uller, Thomas Wiegand, and Wojciech Samek. Deep neural net- works for no-reference and full-reference image quality as- sessment.IEEE Transactions on image processing, 27(1): 206–219, 2017. 6

work page 2017
[11]

Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025. 1

work page arXiv 2025
[12]

Model Cards for IQA-PyTorch - py- iqa 0.1.13 documentation.https://iqa- pytorch

Chaofeng Chen. Model Cards for IQA-PyTorch - py- iqa 0.1.13 documentation.https://iqa- pytorch. readthedocs.io/en/latest/ModelCard.html,

work page
[13]

Accessed: 2026-03-15. 6

work page 2026
[14]

Topiq: A top-down approach from semantics to distortions for image quality assessment.IEEE Transactions on Image Processing, 33:2404–2418, 2024

Chaofeng Chen, Jiadi Mo, Jingwen Hou, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, and Weisi Lin. Topiq: A top-down approach from semantics to distortions for image quality assessment.IEEE Transactions on Image Processing, 33:2404–2418, 2024. 6

work page 2024
[15]

Toward gen- eralized image quality assessment: Relaxing the perfect ref- erence quality assumption

Du Chen, Tianhe Wu, Kede Ma, and Lei Zhang. Toward gen- eralized image quality assessment: Relaxing the perfect ref- erence quality assumption. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12742– 12752, 2025. 4

work page 2025
[16]

Swe-ci: Evaluating agent capabilities in maintain- ing codebases via continuous integration.arXiv preprint arXiv:2603.03823, 2026

Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, and Bing Zhao. Swe-ci: Evaluating agent capabilities in maintain- ing codebases via continuous integration.arXiv preprint arXiv:2603.03823, 2026. 2

work page arXiv 2026
[17]

No-reference blur assessment of digital pictures based on multifeature classifiers.IEEE Transactions on image processing, 20(1):64–75, 2010

Alexandre Ciancio, Eduardo AB Da Silva, Amir Said, Ramin Samadani, Pere Obrador, et al. No-reference blur assessment of digital pictures based on multifeature classifiers.IEEE Transactions on image processing, 20(1):64–75, 2010. 4

work page 2010
[18]

Perceptual quality assessment of smartphone pho- tography

Yuming Fang, Hanwei Zhu, Yan Zeng, Kede Ma, and Zhou Wang. Perceptual quality assessment of smartphone pho- tography. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3677–3686,

work page
[19]

Massive online crowdsourced study of subjective and objective picture qual- ity.IEEE transactions on image processing, 25(1):372–387,

Deepti Ghadiyaram and Alan C Bovik. Massive online crowdsourced study of subjective and objective picture qual- ity.IEEE transactions on image processing, 25(1):372–387,

work page
[20]

No-reference image quality assessment via transformers, rel- ative ranking, and self-consistency

S Alireza Golestaneh, Saba Dadsetan, and Kris M Kitani. No-reference image quality assessment via transformers, rel- ative ranking, and self-consistency. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1220–1230, 2022. 6

work page 2022
[21]

Nano Banana Pro: Gemini 3 Pro Image model from Google DeepMind.https://blog.google/ innovation- and- ai/products/nano- banana- pro/, 2025

Google. Nano Banana Pro: Gemini 3 Pro Image model from Google DeepMind.https://blog.google/ innovation- and- ai/products/nano- banana- pro/, 2025. Accessed: 2026-03-19. 1

work page 2025
[22]

Nano Banana 2: Combining Pro capabilities with lightning-fast speed.https://blog.google/ innovation - and - ai / technology / ai / nano - banana-2/, 2026

Google. Nano Banana 2: Combining Pro capabilities with lightning-fast speed.https://blog.google/ innovation - and - ai / technology / ai / nano - banana-2/, 2026. Accessed: 2026-03-14. 4, 8

work page 2026
[23]

SynthID - Google DeepMind.https: / / deepmind

Google DeepMind. SynthID - Google DeepMind.https: / / deepmind . google / models / synthid/, 2026. Accessed: 2026-03-14. 8

work page 2026
[24]

Quality assessment of higher resolution images and videos with remote testing.Quality and user experience, 8 (1):2, 2023

Steve G ¨oring, Rakesh Rao Ramachandra Rao, and Alexander Raake. Quality assessment of higher resolution images and videos with remote testing.Quality and user experience, 8 (1):2, 2023. 4

work page 2023
[25]

Synthid-image: Image watermarking at internet scale.arXiv preprint arXiv:2510.09263, 2025

Sven Gowal, Rudy Bunel, Florian Stimberg, David Stutz, Guillermo Ortiz-Jimenez, Christina Kouridi, Mel Vecerik, Jamie Hayes, Sylvestre-Alvise Rebuffi, Paul Bernard, et al. Synthid-image: Image watermarking at internet scale.arXiv preprint arXiv:2510.09263, 2025. 8

work page arXiv 2025
[26]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

work page 2017
[27]

Image quality metrics: Psnr vs

Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010. 6

work page 2010
[28]

Koniq-10k: An ecologically valid database for deep learning 9 of blind image quality assessment.IEEE Transactions on Image Processing, 29:4041–4056, 2020

Vlad Hosu, Hanhe Lin, Tamas Sziranyi, and Dietmar Saupe. Koniq-10k: An ecologically valid database for deep learning 9 of blind image quality assessment.IEEE Transactions on Image Processing, 29:4041–4056, 2020. 4, 7

work page 2020
[29]

Uhd-iqa benchmark database: Push- ing the boundaries of blind photo quality assessment

Vlad Hosu, Lorenzo Agnolucci, Oliver Wiedemann, Daisuke Iso, and Dietmar Saupe. Uhd-iqa benchmark database: Push- ing the boundaries of blind photo quality assessment. In European Conference on Computer Vision, pages 467–482. Springer, 2024. 4

work page 2024
[30]

Flaw or artifact? rethinking prompt sensitivity in evaluating LLMs

Andong Hua, Kenan Tang, Chenhe Gu, Jindong Gu, Eric Wong, and Yao Qin. Flaw or artifact? rethinking prompt sensitivity in evaluating LLMs. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Pro- cessing, pages 19889–19899, Suzhou, China, 2025. Associ- ation for Computational Linguistics. 3

work page 2025
[31]

Convolu- tional neural networks for no-reference image quality assess- ment

Le Kang, Peng Ye, Yi Li, and David Doermann. Convolu- tional neural networks for no-reference image quality assess- ment. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1733–1740, 2014. 6

work page 2014
[32]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 6

work page 2021
[33]

Agiqa-3k: An open database for ai-generated image quality assessment.IEEE Transactions on Circuits and Sys- tems for Video Technology, 34(8):6833–6846, 2023

Chunyi Li, Zicheng Zhang, Haoning Wu, Wei Sun, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, and Weisi Lin. Agiqa-3k: An open database for ai-generated image quality assessment.IEEE Transactions on Circuits and Sys- tems for Video Technology, 34(8):6833–6846, 2023. 4

work page 2023
[34]

Freqedit: Preserving high-frequency features for robust multi-turn im- age editing.arXiv preprint arXiv:2512.01755, 2025

Yucheng Liao, Jiajun Liang, Kaiqian Cui, Baoquan Zhao, Haoran Xie, Wei Liu, Qing Li, and Xudong Mao. Freqedit: Preserving high-frequency features for robust multi-turn im- age editing.arXiv preprint arXiv:2512.01755, 2025. 2

work page arXiv 2025
[35]

Beyond cosine similarity: Magnitude-aware clip for no- reference image quality assessment

Zhicheng Liao, Dongxu Wu, Zhenshan Shi, Sijie Mai, Han- wei Zhu, Lingyu Zhu, Yuncheng Jiang, and Baoliang Chen. Beyond cosine similarity: Magnitude-aware clip for no- reference image quality assessment. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6934– 6942, 2026. 6

work page 2026
[36]

Kadid-10k: A large-scale artificially distorted iqa database

Hanhe Lin, Vlad Hosu, and Dietmar Saupe. Kadid-10k: A large-scale artificially distorted iqa database. In2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pages 1–3. IEEE, 2019. 4

work page 2019
[37]

Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent

Yunlong Lin, ZiXu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding, Wenbo Li, and Shuicheng Y AN. Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent. InThe Thirty-ninth Annual Conference on Neural In- formation Processing Systems, 2025. 1, 4

work page 2025
[38]

A no-reference metric for eval- uating the quality of motion deblurring.ACM Transactions on Graphics, 2013

Yiming Liu, Jue Wang, Sunghyun Cho, Adam Finkelstein, and Szymon Rusinkiewicz. A no-reference metric for eval- uating the quality of motion deblurring.ACM Transactions on Graphics, 2013. 4

work page 2013
[39]

Magicquill: An intelligent interactive image editing system

Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Wen Wang, Zhiheng Liu, Qifeng Chen, and Yujun Shen. Magicquill: An intelligent interactive image editing system. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13072–13082, 2025. 1

work page 2025
[40]

Magicquillv2: Precise and interac- tive image editing with layered visual cues.arXiv preprint arXiv:2512.03046, 2025

Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Shuailei Ma, Ka Leong Cheng, Wen Wang, Qingyan Bai, Yuxuan Zhang, Yanhong Zeng, et al. Magicquillv2: Precise and interac- tive image editing with layered visual cues.arXiv preprint arXiv:2512.03046, 2025. 1

work page arXiv 2025
[41]

Learning a no-reference quality metric for single-image super-resolution.Computer Vision and Image Understanding, 158:1–16, 2017

Chao Ma, Chih-Yuan Yang, Xiaokang Yang, and Ming- Hsuan Yang. Learning a no-reference quality metric for single-image super-resolution.Computer Vision and Image Understanding, 158:1–16, 2017. 6

work page 2017
[42]

Quality evaluation of image de- hazing methods using synthetic hazy images.IEEE Transac- tions on Multimedia, 21(9):2319–2333, 2019

Xiongkuo Min, Guangtao Zhai, Ke Gu, Yucheng Zhu, Jiantao Zhou, Guodong Guo, Xiaokang Yang, Xinping Guan, and Wenjun Zhang. Quality evaluation of image de- hazing methods using synthetic hazy images.IEEE Transac- tions on Multimedia, 21(9):2319–2333, 2019. 4

work page 2019
[43]

No-reference image quality assessment in the spatial domain.IEEE Transactions on image processing, 21(12): 4695–4708, 2012

Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. No-reference image quality assessment in the spatial domain.IEEE Transactions on image processing, 21(12): 4695–4708, 2012. 6

work page 2012
[44]

completely blind

Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Mak- ing a “completely blind” image quality analyzer.IEEE Sig- nal processing letters, 20(3):209–212, 2012. 6

work page 2012
[45]

Dynamic prompt optimizing for text- to-image generation

Wenyi Mo, Tianyu Zhang, Yalong Bai, Bing Su, Ji-Rong Wen, and Qing Yang. Dynamic prompt optimizing for text- to-image generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 26627–26636, 2024. 3

work page 2024
[46]

Ava: A large-scale database for aesthetic visual analysis

Naila Murray, Luca Marchesotti, and Florent Perronnin. Ava: A large-scale database for aesthetic visual analysis. In2012 IEEE conference on computer vision and pattern recognition, pages 2408–2415. IEEE, 2012. 4

work page 2012
[47]

Pico- banana-400k: A large-scale dataset for text-guided image editing, 2025

Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico- banana-400k: A large-scale dataset for text-guided image editing, 2025. 2, 4

work page 2025
[48]

Qwen/Qwen-Image-Edit-2511 - Hugging Face

Qwen. Qwen/Qwen-Image-Edit-2511 - Hugging Face. https://huggingface.co/Qwen/Qwen- Image- Edit-2511, 2025. Accessed: 2026-03-14. 8

work page 2025
[49]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next- generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Your agent may misevolve: Emergent risks in self-evolving LLM agents

Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Yang JingYi, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, and Jing Shao. Your agent may misevolve: Emergent risks in self-evolving LLM agents. InSocially Re- sponsible and Trustworthy Foundation Models at NeurIPS 2025, 2025. 2

work page 2025
[51]

Ai models collapse when trained on recursively generated data.Nature, 631 (8022):755–759, 2024

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Pa- pernot, Ross Anderson, and Yarin Gal. Ai models collapse when trained on recursively generated data.Nature, 631 (8022):755–759, 2024. 2

work page 2024
[52]

Blindly assess image qual- ity in the wild guided by a self-adaptive hyper network

Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang. Blindly assess image qual- ity in the wild guided by a self-adaptive hyper network. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 3667–3676, 2020. 6

work page 2020
[53]

Nima: Neural image assessment.IEEE transactions on image processing, 27(8): 3998–4011, 2018

Hossein Talebi and Peyman Milanfar. Nima: Neural image assessment.IEEE transactions on image processing, 27(8): 3998–4011, 2018. 6 10

work page 2018
[54]

SPICE: A synergis- tic, precise, iterative, and customizable image editing work- flow

Kenan Tang, Yanhong Li, and Yao Qin. SPICE: A synergis- tic, precise, iterative, and customizable image editing work- flow. InThe Thirty-ninth Annual Conference on Neural In- formation Processing Systems Creative AI Track: Humanity,

work page
[55]

Blind image quality evaluation using perception based features

Narasimhan Venkatanath, D Praneeth, S Channappayya Sumohana, S Medasani Swarup, et al. Blind image quality evaluation using perception based features. In2015 twenty first national conference on communications (NCC), pages 1–6. IEEE, 2015. 6

work page 2015
[56]

Ex- ploring clip for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Ex- ploring clip for assessing the look and feel of images. InPro- ceedings of the AAAI conference on artificial intelligence, pages 2555–2563, 2023. 6

work page 2023
[57]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

work page 2004
[58]

MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

Xinyu Wei, Kangrui Cen, Hongyang Wei, Zhen Guo, Bairui Li, Zeqing Wang, Jinrui Zhang, and Lei Zhang. Mico-150k: A comprehensive dataset advancing multi-image composi- tion.arXiv preprint arXiv:2512.07348, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 1, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Q-align: Teaching LMMs for visual scoring via discrete text-defined levels

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, Qiong Yan, Xiongkuo Min, Guang- tao Zhai, and Weisi Lin. Q-align: Teaching LMMs for visual scoring via discrete text-defined levels. InForty-first Inter- national Conference on Machine Learning, 2024. 6

work page 2024
[61]

Visualquality-r1: Reasoning-induced image quality assess- ment via reinforcement learning to rank

Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. Visualquality-r1: Reasoning-induced image quality assess- ment via reinforcement learning to rank. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems, 2025. 6, 7

work page 2025
[62]

Maniqa: Multi-dimension attention network for no-reference image quality assessment

Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1191–1200, 2022. 6

work page 2022
[63]

arXiv preprint arXiv:2602.22809(2026)

Mingde Yao, Zhiyuan You, Tam-King Man, Menglu Wang, and Tianfan Xue. Photoagent: Agentic photo editing with exploratory visual aesthetic planning.arXiv preprint arXiv:2602.22809, 2026. 1, 4

work page arXiv 2026
[64]

arXiv preprint arXiv:2602.09084 (2026)

Ruijie Ye, Jiayi Zhang, Zhuoxin Liu, Zihao Zhu, Siyuan Yang, Li Li, Tianfu Fu, Franck Dernoncourt, Yue Zhao, Jiacheng Zhu, et al. Agent banana: High-fidelity image editing with agentic thinking and tooling.arXiv preprint arXiv:2602.09084, 2026. 1, 4

work page arXiv 2026
[65]

From patches to pictures (paq-2-piq): Mapping the perceptual space of pic- ture quality

Zhenqiang Ying, Haoran Niu, Praful Gupta, Dhruv Maha- jan, Deepti Ghadiyaram, and Alan Bovik. From patches to pictures (paq-2-piq): Mapping the perceptual space of pic- ture quality. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3575–3585,

work page
[66]

Model collapse in the self-consuming chain of diffusion finetuning: A novel perspective from quantita- tive trait modeling.arXiv preprint arXiv:2407.17493, 2024

Youngseok Yoon, Dainong Hu, Iain Weissburg, Yao Qin, and Haewon Jeong. Model collapse in the self-consuming chain of diffusion finetuning: A novel perspective from quantita- tive trait modeling.arXiv preprint arXiv:2407.17493, 2024. 2

work page arXiv 2024
[67]

A feature-enriched completely blind image quality evaluator.IEEE Transactions on Image Processing, 24(8):2579–2591, 2015

Lin Zhang, Lei Zhang, and Alan C Bovik. A feature-enriched completely blind image quality evaluator.IEEE Transactions on Image Processing, 24(8):2579–2591, 2015. 6

work page 2015
[68]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6

work page 2018
[69]

Blind image quality assessment using a deep bilinear convolutional neural network.IEEE Transactions on Cir- cuits and Systems for Video Technology, 30(1):36–47, 2018

Weixia Zhang, Kede Ma, Jia Yan, Dexiang Deng, and Zhou Wang. Blind image quality assessment using a deep bilinear convolutional neural network.IEEE Transactions on Cir- cuits and Systems for Video Technology, 30(1):36–47, 2018. 6

work page 2018
[70]

Blind image quality assessment via vision- language correspondence: A multitask learning perspective

Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma. Blind image quality assessment via vision- language correspondence: A multitask learning perspective. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14071–14081, 2023. 6

work page 2023
[71]

Shijie Zhao, Xuanyu Zhang, Weiqi Li, Junlin Li, Li Zhang, Tianfan Xue, and Jian Zhang. Reasoning as representation: Rethinking visual reinforcement learning in image quality assessment.Proceedings of the International Conference on Learning Representations (ICLR), 2026. 6, 7

work page 2026
[72]

Is nano banana pro a low-level vision all-rounder? a comprehensive evaluation on 14 tasks and 40 datasets.arXiv preprint arXiv:2512.15110,

Jialong Zuo, Haoyou Deng, Hanyu Zhou, Jiaxin Zhu, Yicheng Zhang, Yiwei Zhang, Yongxin Yan, Kaixing Huang, Weisen Chen, Yongtai Deng, et al. Is nano banana pro a low-level vision all-rounder? a comprehensive evaluation on 14 tasks and 40 datasets.arXiv preprint arXiv:2512.15110,

work page arXiv
[73]

4KAgent: Agentic any image to 4k super- resolution

Yushen Zuo, Qi Zheng, Mingyang Wu, Xinrui Jiang, Ren- jie Li, Jian Wang, Yide Zhang, Gengchen Mai, Lihong Wang, James Zou, Xiaoyu Wang, Ming-Hsuan Yang, and Zhengzhong Tu. 4KAgent: Agentic any image to 4k super- resolution. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 1, 4 11

work page 2025