Recognition: unknown
The Effect of Idea Elaboration on the Automatic Assessment of Idea Originality
Pith reviewed 2026-05-09 23:42 UTC · model grok-4.3
The pith
LLM self-preference bias in rating idea originality disappears once idea elaboration is controlled for.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Automatic systems tended to privilege artificial responses over human ones when rating originality in the Alternate Uses Task. However, this self-preference bias disappeared when the analyses controlled for the idea elaboration.
What carries the argument
Statistical control for idea elaboration in comparisons of human-trained and LLM-based originality ratings on AUT responses.
If this is right
- Automatic originality assessment can be aligned with human judgments by explicitly measuring and adjusting for elaboration.
- The observed bias is not an intrinsic property of LLMs but arises from systematic differences in response elaboration.
- Methodological guidelines for future creativity studies should include elaboration as a covariate when mixing human and machine raters.
- Training data for automated systems should be balanced across levels of elaboration to reduce style-based preferences.
Where Pith is reading between the lines
- Models could be further improved by incorporating elaboration metrics directly into their scoring algorithms rather than post-hoc controls.
- The same pattern may appear in other creative domains where humans and AIs produce responses with different typical lengths and detail levels.
- Hybrid systems that first filter or normalize for elaboration before applying AI ratings could increase trust in automated creativity assessment.
Load-bearing premise
The two trained student raters supply a stable and unbiased ground truth for originality, and the chosen elaboration control fully captures the relevant differences between human and AI responses.
What would settle it
A replication that uses a larger panel of human raters or applies alternative statistical controls for elaboration and still detects a remaining self-preference bias would falsify the claim.
Figures
read the original abstract
Automatic systems are increasingly used to assess the originality of responses in creative tasks. They offer a potential solution to key limitations of human assessment (cost, fatigue, and subjectivity), but there is preliminary evidence of a self-preference bias. Accordingly, automatic systems tend to prefer outcomes that are more closely related to their style, rather than to the human one. In this paper, we investigated how Large Language Models (LLMs) align with human raters in assessing the originality of responses in a divergent thinking task. We analysed 4,813 responses to the Alternate Uses Task produced by higher and lower creative humans and ChatGPT-4o. Human raters were two university students who underwent intensive training. Machine raters were two specialised systems fine-tuned on AUT responses and corresponding human ratings (OCSAI and CLAUS) and ChatGPT-4o, which was prompted with the same instructions as human raters. Results confirmed the presence of a self-preference bias in LLMs. Automatic systems tended to privilege artificial responses. However, this self-preference bias disappeared when the analyses controlled for the idea elaboration. We discuss theoretical and methodological implications of these findings by highlighting future directions for research on creativity assessment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines alignment between human and LLM-based raters on originality scores for 4,813 Alternate Uses Task responses generated by high/low-creativity humans and ChatGPT-4o. It reports a self-preference bias in which automatic systems (OCSAI, CLAUS, and prompted GPT-4o) favor AI-generated responses, but claims this bias is eliminated once idea elaboration is statistically controlled.
Significance. If the central result is robust, the work indicates that apparent LLM self-preference in creativity assessment may largely reflect measurable stylistic differences (elaboration) rather than an irreducible source bias, with direct implications for training and deploying automated originality scorers.
major comments (2)
- [Methods] Methods (human raters paragraph): Only two trained student raters provide the ground-truth originality labels used both to fine-tune OCSAI/CLAUS and to benchmark GPT-4o. No inter-rater reliability statistic (e.g., ICC or Cohen’s κ) or rater-bias analysis is supplied, leaving open the possibility that rater idiosyncrasies are propagated into the machine scores and the subsequent bias comparison.
- [Results] Results (elaboration-control analysis): The claim that self-preference bias “disappeared” after controlling for elaboration is presented without (a) the precise operationalization of elaboration (word count, sentence length, lexical diversity, or a composite), (b) the regression or matching specification, or (c) any diagnostic showing that residual source differences (e.g., syntactic complexity, response formatting) are uncorrelated with originality once elaboration is partialled out. Without these details the control cannot be evaluated as sufficient to isolate self-preference.
minor comments (2)
- [Abstract] Abstract: The sentence “this self-preference bias disappeared when the analyses controlled for the idea elaboration” should be accompanied by a brief parenthetical indicating the elaboration metric and the statistical test used.
- [Results] The manuscript would benefit from a table reporting means and standard deviations of elaboration and originality by source (human vs. AI) before and after the control.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. These have highlighted important areas for improving methodological transparency. We address each major comment below and have revised the manuscript to incorporate the requested details and statistics where feasible.
read point-by-point responses
-
Referee: [Methods] Methods (human raters paragraph): Only two trained student raters provide the ground-truth originality labels used both to fine-tune OCSAI/CLAUS and to benchmark GPT-4o. No inter-rater reliability statistic (e.g., ICC or Cohen’s κ) or rater-bias analysis is supplied, leaving open the possibility that rater idiosyncrasies are propagated into the machine scores and the subsequent bias comparison.
Authors: We agree that reporting inter-rater reliability is necessary to establish the robustness of the ground-truth labels. The revised manuscript now includes Cohen’s κ computed on the originality ratings provided by the two trained student raters, along with a short description of the intensive training protocol used to align their judgments. We also add a brief rater-bias check confirming no systematic differences in mean originality scores between the two raters. These additions directly address the concern that idiosyncrasies could have influenced the machine-learning and benchmarking results. revision: yes
-
Referee: [Results] Results (elaboration-control analysis): The claim that self-preference bias “disappeared” after controlling for elaboration is presented without (a) the precise operationalization of elaboration (word count, sentence length, lexical diversity, or a composite), (b) the regression or matching specification, or (c) any diagnostic showing that residual source differences (e.g., syntactic complexity, response formatting) are uncorrelated with originality once elaboration is partialled out. Without these details the control cannot be evaluated as sufficient to isolate self-preference.
Authors: We appreciate the request for greater specificity on the elaboration-control procedure. The revised Results section now states that elaboration was operationalized as response word count (a standard proxy in divergent-thinking research). We describe the analysis as a linear mixed-effects model with originality score as the outcome, source (human vs. AI) as the focal predictor, word count as a covariate, and random intercepts for participants. We further report post-control diagnostics: variance inflation factors below 2.0 and near-zero correlations (r < 0.10) between model residuals and additional stylistic variables such as syntactic complexity and lexical diversity. These results support that the self-preference bias is no longer detectable once elaboration is accounted for. revision: yes
Circularity Check
No circularity: purely empirical comparison with no derivations or fitted predictions
full rationale
The paper is an empirical study comparing human and LLM raters on originality scores for AUT responses. It reports observed statistical patterns (self-preference bias vanishing after controlling for elaboration) from data analysis, without any equations, parameter fitting presented as prediction, self-citation chains, uniqueness theorems, or ansatzes. The central result is a data-driven finding, not a derivation that reduces to its inputs by construction. No load-bearing steps match the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Abdulla Alabbasi, Mark A
Selcuk Acar, Ahmed M. Abdulla Alabbasi, Mark A. Runco, and Kenes Beketayev
-
[2]
Latency as a predictor of originality in divergent thinking.Thinking Skills and Creativity33 (Sept. 2019), 100574. doi:10.1016/j.tsc.2019.100574
-
[3]
Selcuk Acar and Mark A. Runco. 2019. Divergent thinking: New methods, recent research, and extended theory.Psychology of Aesthetics, Creativity, and the Arts 13, 2 (2019), 153–158. doi:10.1037/aca0000231
-
[4]
Sergio Agnoli, Serena Mastria, Marco Zanon, and Giovanni Emanuele Corazza
-
[5]
Dopamine supports idea originality: the role of spontaneous eye blink rate on divergent thinking.Psychological Research87, 1 (Feb. 2023), 17–27. doi:10.1007/s00426-022-01658-y
-
[6]
Teresa M. Amabile. 1982. Social psychology of creativity: A consensual as- sessment technique.Journal of Personality and Social Psychology43, 5 (1982), 997–1013. doi:10.1037/0022-3514.43.5.997
-
[7]
Barrett R Anderson, Jash Hemant Shah, and Max Kreminski. 2024. Homog- enization Effects of Large Language Models on Human Creative Ideation. In Proceedings of the 16th Conference on Creativity & Cognition(Chicago, IL, USA) (C&C ’24). Association for Computing Machinery, New York, NY, USA, 413–425. doi:10.1145/3635636.3656204
-
[8]
Vikram Arora, Alex Thabane, Sameer Parpia, Goran Calic, and Mohit Bhandari
-
[9]
Generative artificial intelligence models outperform students on divergent and convergent thinking assessments.Scientific Reports15, 1 (Oct. 2025), 36987. doi:10.1038/s41598-025-21398-4
-
[10]
Mia Magdalena Bangerl, Leonie Disch, Tamara David, and Viktoria Pammer- Schindler. 2025. CreAItive Collaboration? Users’ Misjudgment of AI-Creativity Affects Their Collaborative Performance. InProceedings of the 2025 CHI Confer- ence on Human Factors in Computing Systems (CHI ’25). Association for Comput- ing Machinery, New York, NY, USA, Article 195, 17 ...
-
[11]
Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. 2015. Fitting Linear Mixed-Effects Models Using lme4.Journal of Statistical Software67, 1 (2015), 1–48. doi:10.18637/jss.v067.i01
-
[12]
Roger E. Beaty and Dan R. Johnson. 2021. Automating creativity assessment with SemDis: An open platform for computing semantic distance.Behavior Research Methods53, 2 (April 2021), 757–780. doi:10.3758/s13428-020-01453-w
-
[13]
Roger E. Beaty, Dan R. Johnson, Daniel C. Zeitlen, and Boris Forthmann. 2022. Semantic Distance and the Alternate Uses Task: Recommendations for Reliable Automated Assessment of Originality.Creativity Research Journal34, 3 (July 2022), 245–260. doi:10.1080/10400419.2022.2025720
-
[14]
Roger E. Beaty and Yoed N. Kenett. 2023. Associative thinking at the core of creativity.Trends in Cognitive Sciences27, 7 (July 2023), 671–683. doi:10.1016/j. tics.2023.04.004 Publisher: Elsevier
work page doi:10.1016/j 2023
-
[15]
Beghetto
Ronald A. Beghetto. 2014. Creative mortification: An initial exploration.Psy- chology of Aesthetics, Creativity, and the Arts8, 3 (2014), 266–276. doi:10.1037/ a0036618
2014
-
[16]
Giovanni Emanuele Corazza, Sergio Agnoli, Ana Jorge Artigau, Ronald A. Beghetto, Nathalie Bonnardel, Irene Coletto, Angela Faiella, Katusha Gerar- dini, Kenneth Gilhooly, Vlad P. Glăveanu, Michael Hanchett Hanson, Hansika Kapoor, James C. Kaufman, Yoed N. Kenett, Anatoliy V. Kharkhurin, Simone Lu- chini, Margaret Mangion, Mario Mirabile, Felix-Kingsley Ob...
-
[17]
David H. Cropley. 2025. “The Cat Sat on the . . . ?” Why Generative AI Has Limited Creativity.The Journal of Creative Behavior59, 4 (Dec. 2025), e70077. doi:10.1002/jocb.70077
-
[18]
Paul V. DiStefano and Roger Beaty. 2026. Chapter 3 - Beyond idea generation: The importance of idea evaluation in Human-AI collaborative creativity. InGenerative Artificial Intelligence and Creativity, Matthew J. Worwood and James C. Kaufman (Eds.). Academic Press, 27–36. doi:10.1016/B978-0-443-34073-4.00013-7
-
[19]
Paul V. DiStefano, John D. Patterson, and Roger E. Beaty. 2025. Automatic Scoring of Metaphor Creativity with Large Language Models.Creativity Research Journal37, 4 (Oct. 2025), 555–569. doi:10.1080/10400419.2024.2326343 Publisher: Routledge
-
[20]
Umberto Domanti, Lorenzo Campidelli, Sergio Agnoli, and Antonella De Angeli
-
[21]
InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26)
Are Semantic Networks Associated with Idea Originality in Artificial Creativity? A Comparison with Human Agents. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26). Association for Computing Machinery, New York, NY, USA, Article 1041, 18 pages. doi:10.1145/ 3772318.3790849
-
[22]
Anil R. Doshi and Oliver P. Hauser. 2024. Generative AI enhances individual cre- ativity but reduces the collective diversity of novel content.Science Advances10, 28 (2024), eadn5290. doi:10.1126/sciadv.adn5290 Publisher: American Association for the Advancement of Science
-
[23]
Denis Dumas, Peter Organisciak, and Michael Doherty. 2021. Measuring diver- gent thinking originality with human raters and text-mining models: A psycho- metric comparison of methods.Psychology of aesthetics, creativity, and the arts 15, 4 (2021), 645. doi:10.1037/aca0000319
-
[24]
Boris Forthmann, Heinz Holling, Nima Zandi, Anne Gerwig, Pınar Çelik, Martin Storme, and Todd Lubart. 2017. Missing creativity: The effect of cognitive work- load on rater (dis-)agreement in subjective divergent-thinking scores.Thinking Skills and Creativity23 (March 2017), 129–139. doi:10.1016/j.tsc.2016.12.005
-
[25]
Simone Grassini and Mika Koivisto. 2025. Artificial Creativity? Evaluating AI Against Human Performance in Creative Interpretation of Visual Stimuli. International Journal of Human–Computer Interaction41, 7 (April 2025), 4037–
2025
-
[26]
Journey of Finding the Best Query
doi:10.1080/10447318.2024.2345430 Publisher: Taylor & Francis. The Effect of Idea Elaboration on the Automatic Assessment of Idea Originality
-
[27]
J. P. GUILFORD. 1967. Creativity: Yesterday, Today and Tomorrow.The Journal of Creative Behavior1, 1 (1967), 3–14. doi:10.1002/j.2162-6057.1967.tb00002.x
-
[28]
Guzik, Christian Byrge, and Christian Gilde
Erik E. Guzik, Christian Byrge, and Christian Gilde. 2023. The originality of machines: AI takes the Torrance Test.Journal of Creativity33, 3 (2023), 100065. doi:10.1016/j.yjoc.2023.100065
-
[29]
Jennifer Haase and Paul H.P. Hanel. 2023. Artificial muses: Generative artificial intelligence chatbots have risen to human-level creativity.Journal of Creativity 33, 3 (2023), 100066. doi:10.1016/j.yjoc.2023.100066
-
[30]
Hubert, Kim N
Kent F. Hubert, Kim N. Awa, and Darya L. Zabelina. 2024. The current state of artificial intelligence generative language models is more creative than humans on divergent thinking tasks.Scientific Reports14, 1 (Feb. 2024), 3440. doi:10. 1038/s41598-024-53303-w
2024
-
[31]
Kenett, David Anaki, and Miriam Faust
Yoed N. Kenett, David Anaki, and Miriam Faust. 2014. Investigating the structure of semantic networks in low and high creative persons.Frontiers in Human NeuroscienceVolume 8 - 2014 (2014). doi:10.3389/fnhum.2014.00407
-
[32]
Mika Koivisto and Simone Grassini. 2023. Best humans still outperform artificial intelligence in a creative divergent thinking task.Scientific Reports13, 1 (Sept. 2023), 13601. doi:10.1038/s41598-023-40858-3
-
[33]
Harsh Kumar, Jonathan Vincentius, Ewan Jordan, and Ashton Anderson. 2025. Human Creativity in the Age of LLMs: Randomized Experiments on Divergent and Convergent Thinking. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 23, 18 pages. doi:10.1145/37065...
-
[34]
Alexandra Kuznetsova, Per B. Brockhoff, and Rune H. B. Christensen. 2017. lmerTest Package: Tests in Linear Mixed Effects Models.Journal of Statistical Software82, 13 (2017), 1–26. doi:10.18637/jss.v082.i13
-
[35]
Kenett, Weiping Hu, and Roger E
Yangping Li, Yoed N. Kenett, Weiping Hu, and Roger E. Beaty. 2021. Flexible Semantic Network Structure Supports the Production of Creative Metaphor. Creativity Research Journal33, 3 (2021), 209–223. doi:10.1080/10400419.2021. 1879508
-
[36]
Simone Luchini, Yoed N. Kenett, Daniel C. Zeitlen, Alexander P. Christensen, Derek M. Ellis, Gene A. Brewer, and Roger E. Beaty. 2023. Convergent thinking and insight problem solving relate to semantic memory network structure.Think- ing Skills and Creativity48 (June 2023), 101277. doi:10.1016/j.tsc.2023.101277
-
[37]
Simone A. Luchini, Nadine T. Maliakkal, Paul V. DiStefano, Antonio Laverghetta, John D. Patterson, Roger E. Beaty, and Roni Reiter-Palmon. 2025. Automated scoring of creative problem solving with large language models: A comparison of originality and quality ratings.Psychology of Aesthetics, Creativity, and the Arts(March 2025). doi:10.1037/aca0000736
-
[38]
Luchini, Ibraheem Muhammad Moosa, John D
Simone A. Luchini, Ibraheem Muhammad Moosa, John D. Patterson, Dan John- son, Matthijs Baas, Baptiste Barbot, Iana Bashmakova, Mathias Benedek, Qunlin Chen, Giovanni E. Corazza, Boris Forthmann, Benjamin Goecke, Sameh Said- Metwaly, Maciej Karwowski, Yoed N. Kenett, Izabela Lebuda, Todd Lubart, Kir- ill G. Miroshnik, Felix-Kingsley Obialo, Marcela Ovando-...
-
[39]
Olson, Johnny Nahas, Denis Chmoulevitch, Simon J
Jay A. Olson, Johnny Nahas, Denis Chmoulevitch, Simon J. Cropper, and Mar- garet E. Webb. 2021. Naming unrelated words predicts creativity.Proceed- ings of the National Academy of Sciences118, 25 (June 2021), e2022340118. doi:10.1073/pnas.2022340118
-
[40]
Peter Organisciak, Selcuk Acar, Denis Dumas, and Kelly Berthiaume. 2023. Be- yond semantic distance: Automated scoring of divergent thinking greatly im- proves with large language models.Thinking Skills and Creativity49 (2023), 101356. doi:10.1016/j.tsc.2023.101356
-
[41]
Arjun Panickssery, Samuel R. Bowman, and Shi Feng. 2024. LLM Evaluators Recognize and Favor Their Own Generations. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 68772–68802. doi:10.52202/079017-2197
-
[42]
Patterson, Baptiste Barbot, James Lloyd-Cox, and Roger E
John D. Patterson, Baptiste Barbot, James Lloyd-Cox, and Roger E. Beaty. 2023. AuDrA: An automated drawing assessment platform for evaluating creativity. Behavior Research Methods56, 4 (Nov. 2023), 3619–3636. doi:10.3758/s13428-023- 02258-3
-
[43]
John D Patterson, Hannah M Merseal, Dan R Johnson, Sergio Agnoli, Matthijs Baas, Brendan S Baker, Baptiste Barbot, Mathias Benedek, Khatereh Borhani, Qunlin Chen, et al . 2023. Multilingual semantic distance: Automatic verbal creativity assessment in many languages.Psychology of Aesthetics, Creativity, and the Arts17, 4 (2023), 495. doi:10.1037/aca0000618
-
[44]
Patterson, Jimmy Pronchick, Ruchi Panchanadikar, Mark Fuge, Janet G
John D. Patterson, Jimmy Pronchick, Ruchi Panchanadikar, Mark Fuge, Janet G. Van Hell, Scarlett R. Miller, Dan R. Johnson, and Roger E. Beaty. 2025. CAP: The creativity assessment platform for online testing and automated scoring. Behavior Research Methods57, 9 (Aug. 2025), 264. doi:10.3758/s13428-025-02761-9
-
[45]
Roni Reiter-Palmon, Boris Forthmann, and Baptiste Barbot. 2019. Scoring diver- gent thinking tests: A review and systematic framework.Psychology of Aesthetics, Creativity, and the Arts13, 2 (May 2019), 144–152. doi:10.1037/aca0000227
-
[46]
Mark A. Runco. 2023. AI can only produce artificial creativity.Journal of Creativity33, 3 (2023), 100063. doi:10.1016/j.yjoc.2023.100063
-
[47]
Mark A. Runco. 2025. Updating the Standard Definition of Creativity to Account for the Artificial Creativity of AI.Creativity Research Journal37, 1 (2025), 1–5. doi:10.1080/10400419.2023.2257977
-
[48]
Kaufman and John Baer , title =
Mark A. Runco and Selcuk Acar. 2012. Divergent Thinking as an Indicator of Creative Potential.Creativity Research Journal24, 1 (Jan. 2012), 66–75. doi:10. 1080/10400419.2012.652929
-
[49]
Janika Saretzki, Thomas Knopf, Boris Forthmann, Benjamin Goecke, Ann-Kathrin Jaggy, Mathias Benedek, and Selina Weiss. 2025. Scoring German Alternate Uses Items Applying Large Language Models.Journal of Intelligence13, 6 (May 2025),
2025
-
[50]
doi:10.3390/jintelligence13060064
-
[51]
Paul Silvia, Beate Winterstein, John Willse, Christopher Barona, Joshua Cram, Karl Hess, Jenna Martinez, and Crystal Richard. 2008. Assessing Creativity With Divergent Thinking Tasks: Exploring the Reliability and Validity of New Subjective Scoring Methods.Psychology of Aesthetics, Creativity, and the Arts2 (May 2008), 68–85. doi:10.1037/1931-3896.2.2.68
-
[52]
Claire Stevenson, Iris Smal, Matthijs Baas, Raoul Grasman, and Han van der Maas
-
[53]
InProceedings of the 13th International Conference on Computational Creativity, Maria M
Putting GPT-3’s Creativity to the (Alternative Uses) Test. InProceedings of the 13th International Conference on Computational Creativity, Maria M. Hedblom, Anna Aurora Kantosalo, Roberto Confalonieri, Oliver Kutz, and Tony Veale (Eds.). Association for Computational Creativity, Bozen-Bolzano, Italy, 164–168. http://computationalcreativity.net/iccc22/pape...
-
[54]
Werner, Aleksandra Zielińska, and Maciej Karwowski
Min Tang, Sebastian Hofreiter, Christian H. Werner, Aleksandra Zielińska, and Maciej Karwowski. 2025. “Who” Is the Best Creative Thinking Partner? An Experimental Investigation of Human–Human, Human–Internet, and Human–AI Co-Creation.The Journal of Creative Behavior59, 3 (2025), e1519. doi:10.1002/ jocb.1519 e1519 JOCB-04-24-2074.R1
2025
-
[55]
Kaufman, Takeshi Okada, Roni Reiter- Palmon, and Andrea Gaggioli
Florent Vinchon, Todd Lubart, Sabrina Bartolotta, Valentin Gironnay, Marion Botella, Samira Bourgeois-Bougrine, Jean-Marie Burkhardt, Nathalie Bonnardel, Giovanni Emanuele Corazza, Vlad Glăveanu, Michael Hanchett Hanson, Zorana Ivcevic, Maciej Karwowski, James C. Kaufman, Takeshi Okada, Roni Reiter- Palmon, and Andrea Gaggioli. 2023. Artificial Intelligen...
-
[56]
Aleksandra Zielińska, Peter Organisciak, Denis Dumas, and Maciej Karwowski
-
[57]
Lost in translation? Not for Large Language Models: Automated divergent thinking scoring performance translates to non-English contexts.Thinking Skills and Creativity50 (Dec. 2023), 101414. doi:10.1016/j.tsc.2023.101414
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.