Stable diffusion models reveal a persisting human and AI gap in visual creativity

Antoni Rodriguez-Fornells; Claudia Alvarez-Martin; Dan Dediu; Matthew Pelowski; M. Paz; Olivier Penacchio; Paula Angermair-Barkai; Silvia Rondini; Xim Cerda-Company

arxiv: 2511.16814 · v2 · submitted 2025-11-20 · 💻 cs.AI · cs.HC

Stable diffusion models reveal a persisting human and AI gap in visual creativity

Silvia Rondini , Claudia Alvarez-Martin , Paula Angermair-Barkai , Olivier Penacchio , M. Paz , Matthew Pelowski , Dan Dediu , Antoni Rodriguez-Fornells

show 1 more author

Xim Cerda-Company

This is my paper

Pith reviewed 2026-05-17 20:09 UTC · model grok-4.3

classification 💻 cs.AI cs.HC

keywords visual creativitygenerative AIimage generationhuman-AI comparisoncreativity ratingsstable diffusionperceptual nuance

0 comments

The pith

Human visual artists produce more creative images than AI models even with added human guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares image generation by visual artists, non-artists, and a generative AI model under two prompting conditions that vary the amount of human input. Both human participants and GPT-4o then rated the creativity of all outputs. Results show a steady decline in rated creativity from artists to non-artists to guided AI to self-guided AI, with added human guidance lifting AI performance toward non-artist levels. Human and AI raters also applied different standards when judging creativity. This pattern suggests visual creativity draws on perceptual and contextual abilities that current AI systems do not fully replicate, in contrast to their closer performance on language tasks.

Core claim

Images created by visual artists received the highest creativity ratings, followed by images from non-artists, then AI images generated with human-inspired prompts, and lowest for AI images produced with minimal guidance. Both human raters and GPT-4o produced this same ordering, though the two groups differed in the specific features they weighed when assigning scores. The study concludes that generative AI encounters distinct obstacles in visual creativity because it depends on perceptual nuance and contextual sensitivity that remain largely human capacities.

What carries the argument

The creativity gradient measured across four production groups—visual artists, non-artists, human-inspired AI, and self-guided AI—through ratings collected from both human evaluators and GPT-4o.

Load-bearing premise

Ratings of creativity supplied by human participants and GPT-4o serve as a valid and unbiased way to compare visual creativity across human-made and AI-generated images.

What would settle it

A replication that finds AI images receiving creativity ratings equal to or higher than those of human artists when the same rating scales and rater pools are used.

Figures

Figures reproduced from arXiv: 2511.16814 by Antoni Rodriguez-Fornells, Claudia Alvarez-Martin, Dan Dediu, Matthew Pelowski, M. Paz, Olivier Penacchio, Paula Angermair-Barkai, Silvia Rondini, Xim Cerda-Company.

**Figure 1.** Figure 1: Scheme of the study’s four phases. Phase I: Creative image generation by Visual-Artists and NonArtists. The resulting human images were used to fine-tune the Diffusion model (SDXL) through Low-Rank Adaptation, while the human-generated ideas were used in the prompt of the Human-Inspired GenAI group. Phase II: Creative image generation by GenAI (Human-Inspired and Self-Guided). Phase III: Creative image da… view at source ↗

**Figure 2.** Figure 2: Examples of drawings from each category (in rows): Visual Artists, Non-Artists, HI-GenAI, SG-GenAI. The values reported in the right-hand top corner of each drawing corresponds to the drawing’s Creativity score by humans and GPT-4o raters, respectively. Human ratings: analysis results Overall Creativity Analysis When studying human ratings, an overall Creativity score was generated from the five dimensions… view at source ↗

read the original abstract

While recent research suggests Large Language Models match human creative performance in divergent thinking tasks, visual creativity remains underexplored. This study compared image generation in human participants (Visual Artists and Non Artists) and using an image generation AI model (two prompting conditions with varying human input: high for Human Inspired, low for Self Guided). Human raters (N=255) and GPT4o evaluated the creativity of the resulting images. We found a clear creativity gradient, with Visual Artists being the most creative, followed by Non Artists, then Human Inspired generative AI, and finally Self Guided generative AI. Increased human guidance strongly improved GenAI's creative output, bringing its productions close to those of Non Artists. Notably, human and AI raters also showed vastly different creativity judgment patterns. These results suggest that, in contrast to language centered tasks, GenAI models may face unique challenges in visual domains, where creativity depends on perceptual nuance and contextual sensitivity, distinctly human capacities that may not be readily transferable from language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports a creativity gradient with visual artists ahead of non-artists ahead of guided AI ahead of unguided AI, plus a split between human and GPT-4o raters, but the subjective scores lack reported reliability checks.

read the letter

The main thing to know is that this study finds visual artists rated highest on creativity, followed by non-artists, then AI images made with human-inspired prompts, and lowest for self-guided AI prompts. Human raters and GPT-4o also produced quite different judgment patterns. The work extends earlier text-based comparisons into image generation by varying how much human direction goes into the prompts and by using both human and AI evaluators on the outputs. The gradient comes through clearly and the sample of 255 raters is large enough to give the ratings some stability. The design choice to test high versus low human input is straightforward and shows that extra guidance lifts the AI results close to non-artist level. That part of the setup is useful for anyone thinking about how to steer generative models. The soft spot is the measurement of creativity itself. The abstract gives no inter-rater agreement numbers and no explicit criteria or controls for things like image polish or stylistic familiarity. If raters are mainly reacting to the machine-like look of the AI images rather than to the underlying idea, the gap and the claim about uniquely human perceptual nuance would not hold as strongly. The contrast to language tasks is reasonable to raise but rests on the same unverified ratings. This paper is for researchers working on AI evaluation or the psychology of visual creativity. A reader who wants concrete data on human versus model performance across domains would get value from the conditions tested, though they would need the full methods to judge the stats. It deserves a serious referee because the question is timely and the prompting manipulation is worth checking with tighter validation of the scores. I would send it out for review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper compares visual creativity across human participants (visual artists and non-artists) and Stable Diffusion image generations under two conditions (Human-Inspired with high human input and Self-Guided with low input). Ratings from 255 human participants and GPT-4o reveal a creativity gradient (Visual Artists > Non-Artists > Human-Inspired AI > Self-Guided AI), with increased human guidance improving AI outputs toward non-artist levels. Human and AI raters exhibit different judgment patterns, leading the authors to argue that GenAI faces unique challenges in visual domains due to reliance on perceptual nuance and contextual sensitivity that are distinctly human.

Significance. If the rating methodology is strengthened, the work could usefully extend discussions of AI creativity beyond language tasks by providing comparative evidence from the visual domain. The sample size of 255 human raters and the inclusion of both human and GPT-4o evaluators are positive features that allow direct comparison of judgment patterns.

major comments (3)

[Abstract and Methods] Abstract and Methods: The reported creativity gradient and the claim of a persisting human-AI gap rest on subjective ratings, yet the manuscript provides no inter-rater reliability statistics (e.g., Cronbach’s alpha, ICC, or Fleiss’ kappa), no explicit operational definition or rating criteria for creativity, and no controls for potential confounds such as technical execution quality or stylistic familiarity. Without these, it is unclear whether the observed ordering reflects creativity differences or systematic rater biases against AI-generated images.
[Results] Results: The abstract states that human and AI raters showed “vastly different creativity judgment patterns,” but the manuscript does not report the specific statistical comparisons, effect sizes, or agreement metrics between the two rater groups. This information is necessary to evaluate whether the differing patterns support the interpretation of uniquely human perceptual capacities.
[Discussion] Discussion: The conclusion that GenAI models face unique challenges in visual domains because creativity depends on “perceptual nuance and contextual sensitivity” assumes the ratings isolate these capacities. However, absent bias controls or validation against objective creativity markers, alternative explanations (e.g., raters penalizing AI images for lacking human-like execution cues) remain viable and would weaken the contrast to language-centered tasks.

minor comments (2)

[Abstract] The abstract would be clearer if it briefly stated the exact prompting protocols and image selection criteria used for the AI conditions.
[Figures] Figure captions or legends should explicitly indicate whether error bars represent standard error or confidence intervals and whether statistical significance markers are shown.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for strengthening the methodological transparency and interpretive rigor of our work. We address each major comment below and outline the specific revisions we will implement in the next version of the manuscript.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: The reported creativity gradient and the claim of a persisting human-AI gap rest on subjective ratings, yet the manuscript provides no inter-rater reliability statistics (e.g., Cronbach’s alpha, ICC, or Fleiss’ kappa), no explicit operational definition or rating criteria for creativity, and no controls for potential confounds such as technical execution quality or stylistic familiarity. Without these, it is unclear whether the observed ordering reflects creativity differences or systematic rater biases against AI-generated images.

Authors: We agree that these elements are necessary for robust interpretation of subjective ratings. In the revised manuscript we will add Cronbach’s alpha and ICC values computed across the 255 human raters in the Methods section. We will also insert an explicit operational definition of visual creativity, grounded in established criteria of originality, novelty, and contextual appropriateness. To address potential confounds, we will expand the Methods to detail the anonymous, randomized presentation of images (which reduces stylistic familiarity effects) and add a dedicated limitations paragraph discussing the possibility of execution-quality biases. We note that the observed gradient remained stable across multiple rating conditions, which is consistent with prior creativity research, but we will not claim this fully rules out bias without additional controls. revision: yes
Referee: [Results] Results: The abstract states that human and AI raters showed “vastly different creativity judgment patterns,” but the manuscript does not report the specific statistical comparisons, effect sizes, or agreement metrics between the two rater groups. This information is necessary to evaluate whether the differing patterns support the interpretation of uniquely human perceptual capacities.

Authors: We will revise the Results section to report the requested statistics. This includes mean differences between human and GPT-4o ratings with accompanying t-tests or mixed-effects models, effect sizes (Cohen’s d), and agreement metrics such as Pearson correlations and, where ratings permit, Cohen’s kappa or intraclass correlations between the two rater groups. We will also describe qualitative differences in judgment patterns (e.g., which image attributes each group weighted more heavily) to support the claim of distinct evaluative criteria. revision: yes
Referee: [Discussion] Discussion: The conclusion that GenAI models face unique challenges in visual domains because creativity depends on “perceptual nuance and contextual sensitivity” assumes the ratings isolate these capacities. However, absent bias controls or validation against objective creativity markers, alternative explanations (e.g., raters penalizing AI images for lacking human-like execution cues) remain viable and would weaken the contrast to language-centered tasks.

Authors: We accept that subjective ratings cannot fully isolate perceptual nuance from other cues and that alternative explanations remain plausible. In the revised Discussion we will explicitly enumerate these alternatives, including execution-cue penalties, and temper our claims accordingly while still highlighting the human–AI rater disagreement as evidence of differing judgment bases. We will reference perceptual-processing literature to support the contrast with language tasks. Because the current dataset does not contain separate objective creativity markers or execution-quality ratings, we cannot add post-hoc validation analyses; we will instead frame this as a limitation and propose it as a target for future studies. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical rating study

full rationale

This paper reports an empirical comparison of creativity ratings for images produced by human artists, non-artists, and Stable Diffusion under two prompting regimes, evaluated by both human raters (N=255) and GPT-4o. The abstract and described design contain no equations, fitted parameters, derivation steps, or self-citation chains that reduce the reported gradient or the contrast to language-model performance to prior definitions or inputs by construction. The central claim is grounded in direct experimental outcomes rather than analytic self-reference, satisfying the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of subjective creativity ratings as a proxy and on the assumption that the two prompting conditions adequately sample AI generative capacity.

axioms (1)

domain assumption Creativity in images can be validly quantified through aggregated ratings by humans and GPT-4o
The study treats these ratings as the primary outcome measure without independent validation against other creativity metrics.

pith-pipeline@v0.9.0 · 5507 in / 1117 out tokens · 58546 ms · 2026-05-17T20:09:56.104775+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

clear creativity gradient: Visual Artists > Non-Artists ≥ Human-Inspired GenAI > Self-Guided GenAI
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

human and AI raters also showed vastly different creativity judgment patterns

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

Diversity vs. recognizability: Human-like generalization in one-shot generative models

M. A. Runco, AI can only produce artificial creativity. J. Creat. 33, 100063 (2023). 20. V. Boutin, L. Singhal, X. Thomas, T. Serre, “Diversity vs. recognizability: Human-like generalization in one-shot generative models” in Advances in Neural Information Processing Systems 35 (NeurIPS, New Orleans, USA, 2022)vol. 35. 21. V. Boutin, T. Fel, L. Singhal, R....

work page 2023
[2]

ordinary

J. Pearson, The human imagination: the cognitive neuroscience of visual mental imagery. Nat. Rev. Neurosci. 20, 624–634 (2019). 30. S. M. Kosslyn, Image and Brain: The Resolution of the Imagery Debate (MIT Press, Cambridge, Mass, 1994). 31. S. M. Kosslyn, W. L. Thompson, G. Ganis, The Case for Mental Imagery (Oxford University Press, 2006). 32. S.-H. Lee,...

work page doi:10.48550/arxiv.2401.08276 2019
[3]

Divergent Thinking

M. A. Runco, S. Acar, “Divergent Thinking” in The Cambridge Handbook of Creativity (Cambridge University Press, ed. 2, 2019), pp. 224–254. 51. J. Lehman, E. Meyerson, T. El-Gaaly, K. O. Stanley, T. Ziyaee, Evolution and The Knightian Blindspot of Machine Learning. arXiv [Preprint] (2025). https://doi.org/10.48550/ARXIV.2501.13075. 52. A. Zador, S. Escola,...

work page doi:10.48550/arxiv.2501.13075 2019
[4]

hysteresis type phenomenon

M. Csikszentmihalyi, “Society, culture, and person: A systems view of creativity.” in The Nature of Creativity: Contemporary Psychological Perspectives (Cambridge University Press, Cambridge, UK, R. J. Stenberg.), pp. 325–339. 61. M. A. Runco, The discovery and innovation of AI does not qualify as creativity. J. Cogn. Psychol., 1–10 (2024). 62. V. Venkata...

work page doi:10.13039/501100011033 2024

[1] [1]

Diversity vs. recognizability: Human-like generalization in one-shot generative models

M. A. Runco, AI can only produce artificial creativity. J. Creat. 33, 100063 (2023). 20. V. Boutin, L. Singhal, X. Thomas, T. Serre, “Diversity vs. recognizability: Human-like generalization in one-shot generative models” in Advances in Neural Information Processing Systems 35 (NeurIPS, New Orleans, USA, 2022)vol. 35. 21. V. Boutin, T. Fel, L. Singhal, R....

work page 2023

[2] [2]

ordinary

J. Pearson, The human imagination: the cognitive neuroscience of visual mental imagery. Nat. Rev. Neurosci. 20, 624–634 (2019). 30. S. M. Kosslyn, Image and Brain: The Resolution of the Imagery Debate (MIT Press, Cambridge, Mass, 1994). 31. S. M. Kosslyn, W. L. Thompson, G. Ganis, The Case for Mental Imagery (Oxford University Press, 2006). 32. S.-H. Lee,...

work page doi:10.48550/arxiv.2401.08276 2019

[3] [3]

Divergent Thinking

M. A. Runco, S. Acar, “Divergent Thinking” in The Cambridge Handbook of Creativity (Cambridge University Press, ed. 2, 2019), pp. 224–254. 51. J. Lehman, E. Meyerson, T. El-Gaaly, K. O. Stanley, T. Ziyaee, Evolution and The Knightian Blindspot of Machine Learning. arXiv [Preprint] (2025). https://doi.org/10.48550/ARXIV.2501.13075. 52. A. Zador, S. Escola,...

work page doi:10.48550/arxiv.2501.13075 2019

[4] [4]

hysteresis type phenomenon

M. Csikszentmihalyi, “Society, culture, and person: A systems view of creativity.” in The Nature of Creativity: Contemporary Psychological Perspectives (Cambridge University Press, Cambridge, UK, R. J. Stenberg.), pp. 325–339. 61. M. A. Runco, The discovery and innovation of AI does not qualify as creativity. J. Cogn. Psychol., 1–10 (2024). 62. V. Venkata...

work page doi:10.13039/501100011033 2024