pith. machine review for the scientific record. sign in

arxiv: 2604.20569 · v1 · submitted 2026-04-22 · 💻 cs.HC · cs.AI

Recognition: unknown

The Effect of Idea Elaboration on the Automatic Assessment of Idea Originality

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:42 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords creativity assessmentlarge language modelsidea originalityself-preference biasalternate uses taskidea elaborationautomatic ratingdivergent thinking
0
0 comments X

The pith

LLM self-preference bias in rating idea originality disappears once idea elaboration is controlled for.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models align with human raters when scoring the originality of responses in a divergent thinking task. It finds that automatic systems show a clear preference for responses generated by other AIs over those produced by humans. This preference bias, however, is eliminated when statistical analyses account for differences in how elaborated the ideas are. The result matters because automatic assessment tools are already being deployed to replace or supplement human judges in creativity research, where cost and fatigue limit scale. Understanding that elaboration drives the apparent bias points to a concrete way to improve alignment between machine and human evaluations.

Core claim

Automatic systems tended to privilege artificial responses over human ones when rating originality in the Alternate Uses Task. However, this self-preference bias disappeared when the analyses controlled for the idea elaboration.

What carries the argument

Statistical control for idea elaboration in comparisons of human-trained and LLM-based originality ratings on AUT responses.

If this is right

  • Automatic originality assessment can be aligned with human judgments by explicitly measuring and adjusting for elaboration.
  • The observed bias is not an intrinsic property of LLMs but arises from systematic differences in response elaboration.
  • Methodological guidelines for future creativity studies should include elaboration as a covariate when mixing human and machine raters.
  • Training data for automated systems should be balanced across levels of elaboration to reduce style-based preferences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models could be further improved by incorporating elaboration metrics directly into their scoring algorithms rather than post-hoc controls.
  • The same pattern may appear in other creative domains where humans and AIs produce responses with different typical lengths and detail levels.
  • Hybrid systems that first filter or normalize for elaboration before applying AI ratings could increase trust in automated creativity assessment.

Load-bearing premise

The two trained student raters supply a stable and unbiased ground truth for originality, and the chosen elaboration control fully captures the relevant differences between human and AI responses.

What would settle it

A replication that uses a larger panel of human raters or applies alternative statistical controls for elaboration and still detects a remaining self-preference bias would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.20569 by Antonella De Angeli, Moritz Mock, Sergio Agnoli, Umberto Domanti.

Figure 1
Figure 1. Figure 1: Kernel density estimates of initial response originality scores across raters (Humans, OCSAI, CLAUS, ChatGPT-4o) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a): Initial response originality scores across authors and raters (Humans, OCSAI, CLAUS, ChatGPT-4o). (b): Core idea [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Kernel density estimates of core idea originality scores across raters (OCSAI, CLAUS, ChatGPT-4o) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Automatic systems are increasingly used to assess the originality of responses in creative tasks. They offer a potential solution to key limitations of human assessment (cost, fatigue, and subjectivity), but there is preliminary evidence of a self-preference bias. Accordingly, automatic systems tend to prefer outcomes that are more closely related to their style, rather than to the human one. In this paper, we investigated how Large Language Models (LLMs) align with human raters in assessing the originality of responses in a divergent thinking task. We analysed 4,813 responses to the Alternate Uses Task produced by higher and lower creative humans and ChatGPT-4o. Human raters were two university students who underwent intensive training. Machine raters were two specialised systems fine-tuned on AUT responses and corresponding human ratings (OCSAI and CLAUS) and ChatGPT-4o, which was prompted with the same instructions as human raters. Results confirmed the presence of a self-preference bias in LLMs. Automatic systems tended to privilege artificial responses. However, this self-preference bias disappeared when the analyses controlled for the idea elaboration. We discuss theoretical and methodological implications of these findings by highlighting future directions for research on creativity assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines alignment between human and LLM-based raters on originality scores for 4,813 Alternate Uses Task responses generated by high/low-creativity humans and ChatGPT-4o. It reports a self-preference bias in which automatic systems (OCSAI, CLAUS, and prompted GPT-4o) favor AI-generated responses, but claims this bias is eliminated once idea elaboration is statistically controlled.

Significance. If the central result is robust, the work indicates that apparent LLM self-preference in creativity assessment may largely reflect measurable stylistic differences (elaboration) rather than an irreducible source bias, with direct implications for training and deploying automated originality scorers.

major comments (2)
  1. [Methods] Methods (human raters paragraph): Only two trained student raters provide the ground-truth originality labels used both to fine-tune OCSAI/CLAUS and to benchmark GPT-4o. No inter-rater reliability statistic (e.g., ICC or Cohen’s κ) or rater-bias analysis is supplied, leaving open the possibility that rater idiosyncrasies are propagated into the machine scores and the subsequent bias comparison.
  2. [Results] Results (elaboration-control analysis): The claim that self-preference bias “disappeared” after controlling for elaboration is presented without (a) the precise operationalization of elaboration (word count, sentence length, lexical diversity, or a composite), (b) the regression or matching specification, or (c) any diagnostic showing that residual source differences (e.g., syntactic complexity, response formatting) are uncorrelated with originality once elaboration is partialled out. Without these details the control cannot be evaluated as sufficient to isolate self-preference.
minor comments (2)
  1. [Abstract] Abstract: The sentence “this self-preference bias disappeared when the analyses controlled for the idea elaboration” should be accompanied by a brief parenthetical indicating the elaboration metric and the statistical test used.
  2. [Results] The manuscript would benefit from a table reporting means and standard deviations of elaboration and originality by source (human vs. AI) before and after the control.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. These have highlighted important areas for improving methodological transparency. We address each major comment below and have revised the manuscript to incorporate the requested details and statistics where feasible.

read point-by-point responses
  1. Referee: [Methods] Methods (human raters paragraph): Only two trained student raters provide the ground-truth originality labels used both to fine-tune OCSAI/CLAUS and to benchmark GPT-4o. No inter-rater reliability statistic (e.g., ICC or Cohen’s κ) or rater-bias analysis is supplied, leaving open the possibility that rater idiosyncrasies are propagated into the machine scores and the subsequent bias comparison.

    Authors: We agree that reporting inter-rater reliability is necessary to establish the robustness of the ground-truth labels. The revised manuscript now includes Cohen’s κ computed on the originality ratings provided by the two trained student raters, along with a short description of the intensive training protocol used to align their judgments. We also add a brief rater-bias check confirming no systematic differences in mean originality scores between the two raters. These additions directly address the concern that idiosyncrasies could have influenced the machine-learning and benchmarking results. revision: yes

  2. Referee: [Results] Results (elaboration-control analysis): The claim that self-preference bias “disappeared” after controlling for elaboration is presented without (a) the precise operationalization of elaboration (word count, sentence length, lexical diversity, or a composite), (b) the regression or matching specification, or (c) any diagnostic showing that residual source differences (e.g., syntactic complexity, response formatting) are uncorrelated with originality once elaboration is partialled out. Without these details the control cannot be evaluated as sufficient to isolate self-preference.

    Authors: We appreciate the request for greater specificity on the elaboration-control procedure. The revised Results section now states that elaboration was operationalized as response word count (a standard proxy in divergent-thinking research). We describe the analysis as a linear mixed-effects model with originality score as the outcome, source (human vs. AI) as the focal predictor, word count as a covariate, and random intercepts for participants. We further report post-control diagnostics: variance inflation factors below 2.0 and near-zero correlations (r < 0.10) between model residuals and additional stylistic variables such as syntactic complexity and lexical diversity. These results support that the self-preference bias is no longer detectable once elaboration is accounted for. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or fitted predictions

full rationale

The paper is an empirical study comparing human and LLM raters on originality scores for AUT responses. It reports observed statistical patterns (self-preference bias vanishing after controlling for elaboration) from data analysis, without any equations, parameter fitting presented as prediction, self-citation chains, uniqueness theorems, or ansatzes. The central result is a data-driven finding, not a derivation that reduces to its inputs by construction. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical behavioral study; no mathematical free parameters, axioms, or invented entities are introduced or required.

pith-pipeline@v0.9.0 · 5519 in / 1093 out tokens · 46360 ms · 2026-05-09T23:42:21.921595+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 45 canonical work pages

  1. [1]

    Abdulla Alabbasi, Mark A

    Selcuk Acar, Ahmed M. Abdulla Alabbasi, Mark A. Runco, and Kenes Beketayev

  2. [2]

    2019), 100574

    Latency as a predictor of originality in divergent thinking.Thinking Skills and Creativity33 (Sept. 2019), 100574. doi:10.1016/j.tsc.2019.100574

  3. [3]

    Selcuk Acar and Mark A. Runco. 2019. Divergent thinking: New methods, recent research, and extended theory.Psychology of Aesthetics, Creativity, and the Arts 13, 2 (2019), 153–158. doi:10.1037/aca0000231

  4. [4]

    Sergio Agnoli, Serena Mastria, Marco Zanon, and Giovanni Emanuele Corazza

  5. [5]

    2023), 17–27

    Dopamine supports idea originality: the role of spontaneous eye blink rate on divergent thinking.Psychological Research87, 1 (Feb. 2023), 17–27. doi:10.1007/s00426-022-01658-y

  6. [6]

    Teresa M. Amabile. 1982. Social psychology of creativity: A consensual as- sessment technique.Journal of Personality and Social Psychology43, 5 (1982), 997–1013. doi:10.1037/0022-3514.43.5.997

  7. [7]

    Barrett R Anderson, Jash Hemant Shah, and Max Kreminski. 2024. Homog- enization Effects of Large Language Models on Human Creative Ideation. In Proceedings of the 16th Conference on Creativity & Cognition(Chicago, IL, USA) (C&C ’24). Association for Computing Machinery, New York, NY, USA, 413–425. doi:10.1145/3635636.3656204

  8. [8]

    Vikram Arora, Alex Thabane, Sameer Parpia, Goran Calic, and Mohit Bhandari

  9. [9]

    2025), 36987

    Generative artificial intelligence models outperform students on divergent and convergent thinking assessments.Scientific Reports15, 1 (Oct. 2025), 36987. doi:10.1038/s41598-025-21398-4

  10. [10]

    Mia Magdalena Bangerl, Leonie Disch, Tamara David, and Viktoria Pammer- Schindler. 2025. CreAItive Collaboration? Users’ Misjudgment of AI-Creativity Affects Their Collaborative Performance. InProceedings of the 2025 CHI Confer- ence on Human Factors in Computing Systems (CHI ’25). Association for Comput- ing Machinery, New York, NY, USA, Article 195, 17 ...

  11. [11]

    Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. 2015. Fitting Linear Mixed-Effects Models Using lme4.Journal of Statistical Software67, 1 (2015), 1–48. doi:10.18637/jss.v067.i01

  12. [12]

    Beaty and Dan R

    Roger E. Beaty and Dan R. Johnson. 2021. Automating creativity assessment with SemDis: An open platform for computing semantic distance.Behavior Research Methods53, 2 (April 2021), 757–780. doi:10.3758/s13428-020-01453-w

  13. [13]

    Beaty, Dan R

    Roger E. Beaty, Dan R. Johnson, Daniel C. Zeitlen, and Boris Forthmann. 2022. Semantic Distance and the Alternate Uses Task: Recommendations for Reliable Automated Assessment of Originality.Creativity Research Journal34, 3 (July 2022), 245–260. doi:10.1080/10400419.2022.2025720

  14. [14]

    Nicolás, The bar derived category of a curved dg algebra, Journal of Pure and Applied Algebra 212 (2008) 2633–2659

    Roger E. Beaty and Yoed N. Kenett. 2023. Associative thinking at the core of creativity.Trends in Cognitive Sciences27, 7 (July 2023), 671–683. doi:10.1016/j. tics.2023.04.004 Publisher: Elsevier

  15. [15]

    Beghetto

    Ronald A. Beghetto. 2014. Creative mortification: An initial exploration.Psy- chology of Aesthetics, Creativity, and the Arts8, 3 (2014), 266–276. doi:10.1037/ a0036618

  16. [16]

    Beghetto, Nathalie Bonnardel, Irene Coletto, Angela Faiella, Katusha Gerar- dini, Kenneth Gilhooly, Vlad P

    Giovanni Emanuele Corazza, Sergio Agnoli, Ana Jorge Artigau, Ronald A. Beghetto, Nathalie Bonnardel, Irene Coletto, Angela Faiella, Katusha Gerar- dini, Kenneth Gilhooly, Vlad P. Glăveanu, Michael Hanchett Hanson, Hansika Kapoor, James C. Kaufman, Yoed N. Kenett, Anatoliy V. Kharkhurin, Simone Lu- chini, Margaret Mangion, Mario Mirabile, Felix-Kingsley Ob...

  17. [17]

    The Cat Sat on the . . . ?

    David H. Cropley. 2025. “The Cat Sat on the . . . ?” Why Generative AI Has Limited Creativity.The Journal of Creative Behavior59, 4 (Dec. 2025), e70077. doi:10.1002/jocb.70077

  18. [18]

    DiStefano and Roger Beaty

    Paul V. DiStefano and Roger Beaty. 2026. Chapter 3 - Beyond idea generation: The importance of idea evaluation in Human-AI collaborative creativity. InGenerative Artificial Intelligence and Creativity, Matthew J. Worwood and James C. Kaufman (Eds.). Academic Press, 27–36. doi:10.1016/B978-0-443-34073-4.00013-7

  19. [19]

    DiStefano, John D

    Paul V. DiStefano, John D. Patterson, and Roger E. Beaty. 2025. Automatic Scoring of Metaphor Creativity with Large Language Models.Creativity Research Journal37, 4 (Oct. 2025), 555–569. doi:10.1080/10400419.2024.2326343 Publisher: Routledge

  20. [20]

    Umberto Domanti, Lorenzo Campidelli, Sergio Agnoli, and Antonella De Angeli

  21. [21]

    InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26)

    Are Semantic Networks Associated with Idea Originality in Artificial Creativity? A Comparison with Human Agents. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26). Association for Computing Machinery, New York, NY, USA, Article 1041, 18 pages. doi:10.1145/ 3772318.3790849

  22. [22]

    Doshi and O

    Anil R. Doshi and Oliver P. Hauser. 2024. Generative AI enhances individual cre- ativity but reduces the collective diversity of novel content.Science Advances10, 28 (2024), eadn5290. doi:10.1126/sciadv.adn5290 Publisher: American Association for the Advancement of Science

  23. [23]

    Denis Dumas, Peter Organisciak, and Michael Doherty. 2021. Measuring diver- gent thinking originality with human raters and text-mining models: A psycho- metric comparison of methods.Psychology of aesthetics, creativity, and the arts 15, 4 (2021), 645. doi:10.1037/aca0000319

  24. [24]

    Boris Forthmann, Heinz Holling, Nima Zandi, Anne Gerwig, Pınar Çelik, Martin Storme, and Todd Lubart. 2017. Missing creativity: The effect of cognitive work- load on rater (dis-)agreement in subjective divergent-thinking scores.Thinking Skills and Creativity23 (March 2017), 129–139. doi:10.1016/j.tsc.2016.12.005

  25. [25]

    Simone Grassini and Mika Koivisto. 2025. Artificial Creativity? Evaluating AI Against Human Performance in Creative Interpretation of Visual Stimuli. International Journal of Human–Computer Interaction41, 7 (April 2025), 4037–

  26. [26]

    Journey of Finding the Best Query

    doi:10.1080/10447318.2024.2345430 Publisher: Taylor & Francis. The Effect of Idea Elaboration on the Automatic Assessment of Idea Originality

  27. [27]

    J. P. GUILFORD. 1967. Creativity: Yesterday, Today and Tomorrow.The Journal of Creative Behavior1, 1 (1967), 3–14. doi:10.1002/j.2162-6057.1967.tb00002.x

  28. [28]

    Guzik, Christian Byrge, and Christian Gilde

    Erik E. Guzik, Christian Byrge, and Christian Gilde. 2023. The originality of machines: AI takes the Torrance Test.Journal of Creativity33, 3 (2023), 100065. doi:10.1016/j.yjoc.2023.100065

  29. [29]

    Jennifer Haase and Paul H.P. Hanel. 2023. Artificial muses: Generative artificial intelligence chatbots have risen to human-level creativity.Journal of Creativity 33, 3 (2023), 100066. doi:10.1016/j.yjoc.2023.100066

  30. [30]

    Hubert, Kim N

    Kent F. Hubert, Kim N. Awa, and Darya L. Zabelina. 2024. The current state of artificial intelligence generative language models is more creative than humans on divergent thinking tasks.Scientific Reports14, 1 (Feb. 2024), 3440. doi:10. 1038/s41598-024-53303-w

  31. [31]

    Kenett, David Anaki, and Miriam Faust

    Yoed N. Kenett, David Anaki, and Miriam Faust. 2014. Investigating the structure of semantic networks in low and high creative persons.Frontiers in Human NeuroscienceVolume 8 - 2014 (2014). doi:10.3389/fnhum.2014.00407

  32. [32]

    Mika Koivisto and Simone Grassini. 2023. Best humans still outperform artificial intelligence in a creative divergent thinking task.Scientific Reports13, 1 (Sept. 2023), 13601. doi:10.1038/s41598-023-40858-3

  33. [33]

    Harsh Kumar, Jonathan Vincentius, Ewan Jordan, and Ashton Anderson. 2025. Human Creativity in the Age of LLMs: Randomized Experiments on Divergent and Convergent Thinking. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 23, 18 pages. doi:10.1145/37065...

  34. [34]

    Brockhoff, and Rune H

    Alexandra Kuznetsova, Per B. Brockhoff, and Rune H. B. Christensen. 2017. lmerTest Package: Tests in Linear Mixed Effects Models.Journal of Statistical Software82, 13 (2017), 1–26. doi:10.18637/jss.v082.i13

  35. [35]

    Kenett, Weiping Hu, and Roger E

    Yangping Li, Yoed N. Kenett, Weiping Hu, and Roger E. Beaty. 2021. Flexible Semantic Network Structure Supports the Production of Creative Metaphor. Creativity Research Journal33, 3 (2021), 209–223. doi:10.1080/10400419.2021. 1879508

  36. [36]

    Kenett, Daniel C

    Simone Luchini, Yoed N. Kenett, Daniel C. Zeitlen, Alexander P. Christensen, Derek M. Ellis, Gene A. Brewer, and Roger E. Beaty. 2023. Convergent thinking and insight problem solving relate to semantic memory network structure.Think- ing Skills and Creativity48 (June 2023), 101277. doi:10.1016/j.tsc.2023.101277

  37. [37]

    Luchini, Nadine T

    Simone A. Luchini, Nadine T. Maliakkal, Paul V. DiStefano, Antonio Laverghetta, John D. Patterson, Roger E. Beaty, and Roni Reiter-Palmon. 2025. Automated scoring of creative problem solving with large language models: A comparison of originality and quality ratings.Psychology of Aesthetics, Creativity, and the Arts(March 2025). doi:10.1037/aca0000736

  38. [38]

    Luchini, Ibraheem Muhammad Moosa, John D

    Simone A. Luchini, Ibraheem Muhammad Moosa, John D. Patterson, Dan John- son, Matthijs Baas, Baptiste Barbot, Iana Bashmakova, Mathias Benedek, Qunlin Chen, Giovanni E. Corazza, Boris Forthmann, Benjamin Goecke, Sameh Said- Metwaly, Maciej Karwowski, Yoed N. Kenett, Izabela Lebuda, Todd Lubart, Kir- ill G. Miroshnik, Felix-Kingsley Obialo, Marcela Ovando-...

  39. [39]

    Olson, Johnny Nahas, Denis Chmoulevitch, Simon J

    Jay A. Olson, Johnny Nahas, Denis Chmoulevitch, Simon J. Cropper, and Mar- garet E. Webb. 2021. Naming unrelated words predicts creativity.Proceed- ings of the National Academy of Sciences118, 25 (June 2021), e2022340118. doi:10.1073/pnas.2022340118

  40. [40]

    Peter Organisciak, Selcuk Acar, Denis Dumas, and Kelly Berthiaume. 2023. Be- yond semantic distance: Automated scoring of divergent thinking greatly im- proves with large language models.Thinking Skills and Creativity49 (2023), 101356. doi:10.1016/j.tsc.2023.101356

  41. [41]

    Bowman, and Shi Feng

    Arjun Panickssery, Samuel R. Bowman, and Shi Feng. 2024. LLM Evaluators Recognize and Favor Their Own Generations. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 68772–68802. doi:10.52202/079017-2197

  42. [42]

    Patterson, Baptiste Barbot, James Lloyd-Cox, and Roger E

    John D. Patterson, Baptiste Barbot, James Lloyd-Cox, and Roger E. Beaty. 2023. AuDrA: An automated drawing assessment platform for evaluating creativity. Behavior Research Methods56, 4 (Nov. 2023), 3619–3636. doi:10.3758/s13428-023- 02258-3

  43. [43]

    John D Patterson, Hannah M Merseal, Dan R Johnson, Sergio Agnoli, Matthijs Baas, Brendan S Baker, Baptiste Barbot, Mathias Benedek, Khatereh Borhani, Qunlin Chen, et al . 2023. Multilingual semantic distance: Automatic verbal creativity assessment in many languages.Psychology of Aesthetics, Creativity, and the Arts17, 4 (2023), 495. doi:10.1037/aca0000618

  44. [44]

    Patterson, Jimmy Pronchick, Ruchi Panchanadikar, Mark Fuge, Janet G

    John D. Patterson, Jimmy Pronchick, Ruchi Panchanadikar, Mark Fuge, Janet G. Van Hell, Scarlett R. Miller, Dan R. Johnson, and Roger E. Beaty. 2025. CAP: The creativity assessment platform for online testing and automated scoring. Behavior Research Methods57, 9 (Aug. 2025), 264. doi:10.3758/s13428-025-02761-9

  45. [45]

    Roni Reiter-Palmon, Boris Forthmann, and Baptiste Barbot. 2019. Scoring diver- gent thinking tests: A review and systematic framework.Psychology of Aesthetics, Creativity, and the Arts13, 2 (May 2019), 144–152. doi:10.1037/aca0000227

  46. [46]

    Mark A. Runco. 2023. AI can only produce artificial creativity.Journal of Creativity33, 3 (2023), 100063. doi:10.1016/j.yjoc.2023.100063

  47. [47]

    Mark A. Runco. 2025. Updating the Standard Definition of Creativity to Account for the Artificial Creativity of AI.Creativity Research Journal37, 1 (2025), 1–5. doi:10.1080/10400419.2023.2257977

  48. [48]

    Kaufman and John Baer , title =

    Mark A. Runco and Selcuk Acar. 2012. Divergent Thinking as an Indicator of Creative Potential.Creativity Research Journal24, 1 (Jan. 2012), 66–75. doi:10. 1080/10400419.2012.652929

  49. [49]

    Janika Saretzki, Thomas Knopf, Boris Forthmann, Benjamin Goecke, Ann-Kathrin Jaggy, Mathias Benedek, and Selina Weiss. 2025. Scoring German Alternate Uses Items Applying Large Language Models.Journal of Intelligence13, 6 (May 2025),

  50. [50]

    doi:10.3390/jintelligence13060064

  51. [51]

    Paul Silvia, Beate Winterstein, John Willse, Christopher Barona, Joshua Cram, Karl Hess, Jenna Martinez, and Crystal Richard. 2008. Assessing Creativity With Divergent Thinking Tasks: Exploring the Reliability and Validity of New Subjective Scoring Methods.Psychology of Aesthetics, Creativity, and the Arts2 (May 2008), 68–85. doi:10.1037/1931-3896.2.2.68

  52. [52]

    Claire Stevenson, Iris Smal, Matthijs Baas, Raoul Grasman, and Han van der Maas

  53. [53]

    InProceedings of the 13th International Conference on Computational Creativity, Maria M

    Putting GPT-3’s Creativity to the (Alternative Uses) Test. InProceedings of the 13th International Conference on Computational Creativity, Maria M. Hedblom, Anna Aurora Kantosalo, Roberto Confalonieri, Oliver Kutz, and Tony Veale (Eds.). Association for Computational Creativity, Bozen-Bolzano, Italy, 164–168. http://computationalcreativity.net/iccc22/pape...

  54. [54]

    Werner, Aleksandra Zielińska, and Maciej Karwowski

    Min Tang, Sebastian Hofreiter, Christian H. Werner, Aleksandra Zielińska, and Maciej Karwowski. 2025. “Who” Is the Best Creative Thinking Partner? An Experimental Investigation of Human–Human, Human–Internet, and Human–AI Co-Creation.The Journal of Creative Behavior59, 3 (2025), e1519. doi:10.1002/ jocb.1519 e1519 JOCB-04-24-2074.R1

  55. [55]

    Kaufman, Takeshi Okada, Roni Reiter- Palmon, and Andrea Gaggioli

    Florent Vinchon, Todd Lubart, Sabrina Bartolotta, Valentin Gironnay, Marion Botella, Samira Bourgeois-Bougrine, Jean-Marie Burkhardt, Nathalie Bonnardel, Giovanni Emanuele Corazza, Vlad Glăveanu, Michael Hanchett Hanson, Zorana Ivcevic, Maciej Karwowski, James C. Kaufman, Takeshi Okada, Roni Reiter- Palmon, and Andrea Gaggioli. 2023. Artificial Intelligen...

  56. [56]

    Aleksandra Zielińska, Peter Organisciak, Denis Dumas, and Maciej Karwowski

  57. [57]

    2023), 101414

    Lost in translation? Not for Large Language Models: Automated divergent thinking scoring performance translates to non-English contexts.Thinking Skills and Creativity50 (Dec. 2023), 101414. doi:10.1016/j.tsc.2023.101414