Stable diffusion models reveal a persisting human and AI gap in visual creativity
Pith reviewed 2026-05-17 20:09 UTC · model grok-4.3
The pith
Human visual artists produce more creative images than AI models even with added human guidance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Images created by visual artists received the highest creativity ratings, followed by images from non-artists, then AI images generated with human-inspired prompts, and lowest for AI images produced with minimal guidance. Both human raters and GPT-4o produced this same ordering, though the two groups differed in the specific features they weighed when assigning scores. The study concludes that generative AI encounters distinct obstacles in visual creativity because it depends on perceptual nuance and contextual sensitivity that remain largely human capacities.
What carries the argument
The creativity gradient measured across four production groups—visual artists, non-artists, human-inspired AI, and self-guided AI—through ratings collected from both human evaluators and GPT-4o.
Load-bearing premise
Ratings of creativity supplied by human participants and GPT-4o serve as a valid and unbiased way to compare visual creativity across human-made and AI-generated images.
What would settle it
A replication that finds AI images receiving creativity ratings equal to or higher than those of human artists when the same rating scales and rater pools are used.
Figures
read the original abstract
While recent research suggests Large Language Models match human creative performance in divergent thinking tasks, visual creativity remains underexplored. This study compared image generation in human participants (Visual Artists and Non Artists) and using an image generation AI model (two prompting conditions with varying human input: high for Human Inspired, low for Self Guided). Human raters (N=255) and GPT4o evaluated the creativity of the resulting images. We found a clear creativity gradient, with Visual Artists being the most creative, followed by Non Artists, then Human Inspired generative AI, and finally Self Guided generative AI. Increased human guidance strongly improved GenAI's creative output, bringing its productions close to those of Non Artists. Notably, human and AI raters also showed vastly different creativity judgment patterns. These results suggest that, in contrast to language centered tasks, GenAI models may face unique challenges in visual domains, where creativity depends on perceptual nuance and contextual sensitivity, distinctly human capacities that may not be readily transferable from language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper compares visual creativity across human participants (visual artists and non-artists) and Stable Diffusion image generations under two conditions (Human-Inspired with high human input and Self-Guided with low input). Ratings from 255 human participants and GPT-4o reveal a creativity gradient (Visual Artists > Non-Artists > Human-Inspired AI > Self-Guided AI), with increased human guidance improving AI outputs toward non-artist levels. Human and AI raters exhibit different judgment patterns, leading the authors to argue that GenAI faces unique challenges in visual domains due to reliance on perceptual nuance and contextual sensitivity that are distinctly human.
Significance. If the rating methodology is strengthened, the work could usefully extend discussions of AI creativity beyond language tasks by providing comparative evidence from the visual domain. The sample size of 255 human raters and the inclusion of both human and GPT-4o evaluators are positive features that allow direct comparison of judgment patterns.
major comments (3)
- [Abstract and Methods] Abstract and Methods: The reported creativity gradient and the claim of a persisting human-AI gap rest on subjective ratings, yet the manuscript provides no inter-rater reliability statistics (e.g., Cronbach’s alpha, ICC, or Fleiss’ kappa), no explicit operational definition or rating criteria for creativity, and no controls for potential confounds such as technical execution quality or stylistic familiarity. Without these, it is unclear whether the observed ordering reflects creativity differences or systematic rater biases against AI-generated images.
- [Results] Results: The abstract states that human and AI raters showed “vastly different creativity judgment patterns,” but the manuscript does not report the specific statistical comparisons, effect sizes, or agreement metrics between the two rater groups. This information is necessary to evaluate whether the differing patterns support the interpretation of uniquely human perceptual capacities.
- [Discussion] Discussion: The conclusion that GenAI models face unique challenges in visual domains because creativity depends on “perceptual nuance and contextual sensitivity” assumes the ratings isolate these capacities. However, absent bias controls or validation against objective creativity markers, alternative explanations (e.g., raters penalizing AI images for lacking human-like execution cues) remain viable and would weaken the contrast to language-centered tasks.
minor comments (2)
- [Abstract] The abstract would be clearer if it briefly stated the exact prompting protocols and image selection criteria used for the AI conditions.
- [Figures] Figure captions or legends should explicitly indicate whether error bars represent standard error or confidence intervals and whether statistical significance markers are shown.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which highlights important areas for strengthening the methodological transparency and interpretive rigor of our work. We address each major comment below and outline the specific revisions we will implement in the next version of the manuscript.
read point-by-point responses
-
Referee: [Abstract and Methods] Abstract and Methods: The reported creativity gradient and the claim of a persisting human-AI gap rest on subjective ratings, yet the manuscript provides no inter-rater reliability statistics (e.g., Cronbach’s alpha, ICC, or Fleiss’ kappa), no explicit operational definition or rating criteria for creativity, and no controls for potential confounds such as technical execution quality or stylistic familiarity. Without these, it is unclear whether the observed ordering reflects creativity differences or systematic rater biases against AI-generated images.
Authors: We agree that these elements are necessary for robust interpretation of subjective ratings. In the revised manuscript we will add Cronbach’s alpha and ICC values computed across the 255 human raters in the Methods section. We will also insert an explicit operational definition of visual creativity, grounded in established criteria of originality, novelty, and contextual appropriateness. To address potential confounds, we will expand the Methods to detail the anonymous, randomized presentation of images (which reduces stylistic familiarity effects) and add a dedicated limitations paragraph discussing the possibility of execution-quality biases. We note that the observed gradient remained stable across multiple rating conditions, which is consistent with prior creativity research, but we will not claim this fully rules out bias without additional controls. revision: yes
-
Referee: [Results] Results: The abstract states that human and AI raters showed “vastly different creativity judgment patterns,” but the manuscript does not report the specific statistical comparisons, effect sizes, or agreement metrics between the two rater groups. This information is necessary to evaluate whether the differing patterns support the interpretation of uniquely human perceptual capacities.
Authors: We will revise the Results section to report the requested statistics. This includes mean differences between human and GPT-4o ratings with accompanying t-tests or mixed-effects models, effect sizes (Cohen’s d), and agreement metrics such as Pearson correlations and, where ratings permit, Cohen’s kappa or intraclass correlations between the two rater groups. We will also describe qualitative differences in judgment patterns (e.g., which image attributes each group weighted more heavily) to support the claim of distinct evaluative criteria. revision: yes
-
Referee: [Discussion] Discussion: The conclusion that GenAI models face unique challenges in visual domains because creativity depends on “perceptual nuance and contextual sensitivity” assumes the ratings isolate these capacities. However, absent bias controls or validation against objective creativity markers, alternative explanations (e.g., raters penalizing AI images for lacking human-like execution cues) remain viable and would weaken the contrast to language-centered tasks.
Authors: We accept that subjective ratings cannot fully isolate perceptual nuance from other cues and that alternative explanations remain plausible. In the revised Discussion we will explicitly enumerate these alternatives, including execution-cue penalties, and temper our claims accordingly while still highlighting the human–AI rater disagreement as evidence of differing judgment bases. We will reference perceptual-processing literature to support the contrast with language tasks. Because the current dataset does not contain separate objective creativity markers or execution-quality ratings, we cannot add post-hoc validation analyses; we will instead frame this as a limitation and propose it as a target for future studies. revision: partial
Circularity Check
No significant circularity in empirical rating study
full rationale
This paper reports an empirical comparison of creativity ratings for images produced by human artists, non-artists, and Stable Diffusion under two prompting regimes, evaluated by both human raters (N=255) and GPT-4o. The abstract and described design contain no equations, fitted parameters, derivation steps, or self-citation chains that reduce the reported gradient or the contrast to language-model performance to prior definitions or inputs by construction. The central claim is grounded in direct experimental outcomes rather than analytic self-reference, satisfying the default expectation for non-circular empirical work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Creativity in images can be validly quantified through aggregated ratings by humans and GPT-4o
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
clear creativity gradient: Visual Artists > Non-Artists ≥ Human-Inspired GenAI > Self-Guided GenAI
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
human and AI raters also showed vastly different creativity judgment patterns
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Diversity vs. recognizability: Human-like generalization in one-shot generative models
M. A. Runco, AI can only produce artificial creativity. J. Creat. 33, 100063 (2023). 20. V. Boutin, L. Singhal, X. Thomas, T. Serre, “Diversity vs. recognizability: Human-like generalization in one-shot generative models” in Advances in Neural Information Processing Systems 35 (NeurIPS, New Orleans, USA, 2022)vol. 35. 21. V. Boutin, T. Fel, L. Singhal, R....
work page 2023
-
[2]
J. Pearson, The human imagination: the cognitive neuroscience of visual mental imagery. Nat. Rev. Neurosci. 20, 624–634 (2019). 30. S. M. Kosslyn, Image and Brain: The Resolution of the Imagery Debate (MIT Press, Cambridge, Mass, 1994). 31. S. M. Kosslyn, W. L. Thompson, G. Ganis, The Case for Mental Imagery (Oxford University Press, 2006). 32. S.-H. Lee,...
-
[3]
M. A. Runco, S. Acar, “Divergent Thinking” in The Cambridge Handbook of Creativity (Cambridge University Press, ed. 2, 2019), pp. 224–254. 51. J. Lehman, E. Meyerson, T. El-Gaaly, K. O. Stanley, T. Ziyaee, Evolution and The Knightian Blindspot of Machine Learning. arXiv [Preprint] (2025). https://doi.org/10.48550/ARXIV.2501.13075. 52. A. Zador, S. Escola,...
-
[4]
M. Csikszentmihalyi, “Society, culture, and person: A systems view of creativity.” in The Nature of Creativity: Contemporary Psychological Perspectives (Cambridge University Press, Cambridge, UK, R. J. Stenberg.), pp. 325–339. 61. M. A. Runco, The discovery and innovation of AI does not qualify as creativity. J. Cogn. Psychol., 1–10 (2024). 62. V. Venkata...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.