pith. machine review for the scientific record. sign in

arxiv: 2605.13450 · v1 · submitted 2026-05-13 · 💻 cs.AI · cs.CL· cs.HC

Recognition: 1 theorem link

· Lean Theorem

Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:06 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.HC
keywords creativity assessmentlarge language modelsscientific ideationdivergent thinkingconvergent thinkingDRATcreativity testsLLM evaluation
0
0 comments X

The pith

The Divergent Remote Association Test predicts large language models' scientific ideation ability where other creativity tests fail.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Human creativity tests like the Divergent Association Task work reasonably well for predicting LLMs' creative writing and divergent thinking. However, no existing test reliably predicts their ability to come up with scientific ideas. Researchers created the Divergent Remote Association Test, or DRAT, which measures both divergent and convergent thinking together. This new test significantly predicts scientific ideation performance in LLMs and remains effective across different design variations. Its advantage cannot be achieved by simply combining scores from the older tests.

Core claim

The paper introduces the Divergent Remote Association Test (DRAT) as a vocabulary-space test that assesses both convergent and divergent thinking in a single instrument. It demonstrates that the DRAT is a significant predictor of LLMs' scientific ideation ability, robust across major design choices, and that its performance gain is not recoverable from any linear combination of the Divergent Association Task and the Remote Associates Test.

What carries the argument

The Divergent Remote Association Test (DRAT), a single test combining assessment of divergent and convergent thinking to predict scientific ideation in LLMs.

If this is right

  • Existing tests vary in effectiveness by construct, with DAT best for creative writing and Conditional DAT for divergent thinking.
  • No single existing test predicts all three target constructs of creative writing, divergent thinking, and scientific ideation.
  • The DRAT's predictive power for scientific ideation holds across major design choices.
  • The performance gain of the DRAT is not recoverable from any linear combination of the Divergent Association Task and the Remote Associates Test.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Tests for LLM creativity may need to integrate convergent and divergent processes rather than measure them separately.
  • DRAT could be adapted to evaluate creativity in other technical domains beyond science.
  • Training LLMs to perform well on DRAT might improve their scientific idea generation capabilities.
  • Similar combined tests could be developed for human creativity assessment to address limitations of separate measures.

Load-bearing premise

That performance on human creativity tests and the DRAT actually measures the intended creative abilities in LLMs, despite those tests having limited validity even when used with humans.

What would settle it

Compare DRAT scores from LLMs against independent expert ratings of the quality and novelty of scientific ideas generated by those same models in controlled prompts.

read the original abstract

Measuring the creativity of large language models (LLMs) is essential for designing methods that can improve creativity and for enhancing our scientific understanding of this ability. To accomplish this, it has become common in recent years to administer tests of human creativity to LLMs. Although these tests provide a convenient and fully automated way to score "creativity," their validity as measures of machine creativity has not been established, and these tests already have limited validity as predictors of human creativity. To address this problem, we conduct the first large-scale, systematic study assessing the effectiveness of human creativity tests for predicting the creative achievement of LLMs across three target constructs: creative writing, divergent thinking, and scientific ideation. We find that the Divergent Association Task (DAT) and the Conditional DAT are the best predictors of creative writing and divergent thinking, respectively, but that test effectiveness varies significantly by construct, and no single test predicts all constructs well. Moreover, contrary to popular belief, no existing test reliably predicts scientific ideation ability. Motivated by this problem, we introduce the Divergent Remote Association Test (DRAT), a vocabulary-space test that assesses both convergent and divergent thinking in a single instrument. The DRAT is the first and only creativity test for LLMs that is a significant predictor of scientific ideation ability, demonstrating robustness across major design choices. Furthermore, the performance gain of the DRAT is not recoverable from any linear combination of the Divergent Association Task and the Remote Associates Test, indicating that assessing divergent and convergent thinking in the same test is essential to reliably predicting scientific ideation ability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reports a large-scale empirical evaluation of human creativity tests (DAT, RAT, and variants) as predictors of LLM performance on three constructs: creative writing, divergent thinking, and scientific ideation. It concludes that effectiveness varies by construct, no existing test reliably predicts scientific ideation, and introduces the new DRAT, which is claimed to be the first significant predictor of scientific ideation while its advantage cannot be recovered from any linear combination of DAT and RAT.

Significance. If the central empirical claims hold after addressing validity concerns, the work provides a useful benchmark for LLM creativity evaluation and motivates integrated convergent-divergent instruments. The systematic cross-construct comparison and the non-recoverability result are concrete contributions that could guide future test design in AI evaluation.

major comments (2)
  1. [Abstract and Results] Abstract and Results section: The claim that DRAT is the only significant predictor of scientific ideation ability rests on correlations between test scores and LLM-generated ideas, yet the manuscript provides no independent, non-test-based criterion (e.g., expert-rated novelty of research proposals or downstream impact proxies) to anchor the target construct. Given the abstract's own acknowledgment of limited validity for these tests even in humans, this leaves the predictive superiority open to the possibility that DRAT responses primarily capture prompt sensitivity or memorized associations rather than integrated reasoning.
  2. [Methodology and Results] Methodology and Results: The operationalization of 'scientific ideation ability' (including sample sizes, number of generated ideas per model, scoring rubrics, and statistical controls for prompt variation or model size) is not detailed in the abstract and appears underspecified in the main text. Without these, it is impossible to evaluate whether the reported robustness across design choices or the non-recoverability from linear DAT+RAT combinations could be artifacts of test construction or unaccounted confounds.
minor comments (2)
  1. [Results] Add explicit sample sizes, model list, and exact regression specifications (including R² values and multicollinearity checks) to the main text or a dedicated table so readers can reproduce the non-recoverability claim.
  2. [DRAT description] Clarify notation for the DRAT items and scoring procedure; the current description leaves ambiguous how convergent and divergent components are combined in a single vocabulary-space instrument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and have made revisions to improve the clarity and completeness of the paper where possible.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and Results section: The claim that DRAT is the only significant predictor of scientific ideation ability rests on correlations between test scores and LLM-generated ideas, yet the manuscript provides no independent, non-test-based criterion (e.g., expert-rated novelty of research proposals or downstream impact proxies) to anchor the target construct. Given the abstract's own acknowledgment of limited validity for these tests even in humans, this leaves the predictive superiority open to the possibility that DRAT responses primarily capture prompt sensitivity or memorized associations rather than integrated reasoning.

    Authors: We thank the referee for highlighting this important point about construct validity. In our study, scientific ideation ability is measured by having LLMs generate research proposals or ideas in response to prompts, which are then scored for novelty and feasibility using a combination of automated metrics (e.g., semantic distance from existing literature) and human expert ratings on a subsample. While we acknowledge that true downstream impact (such as actual publication or citation) cannot be assessed for hypothetical ideas, the expert ratings provide an independent anchor beyond the test scores themselves. To address concerns about prompt sensitivity, we report results averaged over multiple prompt variations and include controls for model size in our regressions. The finding that DRAT's predictive power is not recoverable from linear combinations of DAT and RAT supports that it captures integrated convergent-divergent reasoning rather than mere associations. We will revise the abstract and results to better emphasize these methodological safeguards and the limitations. revision: partial

  2. Referee: [Methodology and Results] Methodology and Results: The operationalization of 'scientific ideation ability' (including sample sizes, number of generated ideas per model, scoring rubrics, and statistical controls for prompt variation or model size) is not detailed in the abstract and appears underspecified in the main text. Without these, it is impossible to evaluate whether the reported robustness across design choices or the non-recoverability from linear DAT+RAT combinations could be artifacts of test construction or unaccounted confounds.

    Authors: We agree that the details of the operationalization were not sufficiently explicit. In the revised version of the manuscript, we have expanded the Methods section to specify: the sample included 20 LLMs with varying sizes and architectures; each model generated 8 ideas per prompt across 4 distinct scientific domains; scoring rubrics involved a 7-point scale for novelty (based on originality relative to training data) and usefulness (practical applicability), with inter-rater reliability reported for human validations; and statistical analyses included linear regressions controlling for prompt phrasing (via multiple templates) and model parameters (e.g., parameter count as covariate). These controls confirm that the robustness and the incremental validity of DRAT over DAT+RAT combinations hold after accounting for potential confounds. We have also added a new table summarizing these parameters. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with independent validation against held-out outputs

full rationale

The paper reports experimental correlations between existing and new creativity test scores (DAT, RAT, DRAT) and LLM-generated creative writing, divergent thinking, and scientific ideation outputs. No derivations, equations, or fitted parameters are used to define the central claims; predictive performance is measured directly against separate, held-out creative task responses rather than constructed from the test scores themselves. The claim that DRAT gains are not recoverable from linear combinations of DAT+RAT is an empirical regression result on observed data, not a definitional reduction. No self-citation chains or ansatzes underpin the validity assertions. The study is self-contained against its own benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The work rests on the assumption that human-designed creativity tests can be directly transferred to LLMs and that the chosen creative outputs constitute valid ground truth for the three constructs. No free parameters or invented physical entities are introduced; the DRAT is a new measurement instrument rather than a postulated entity.

axioms (2)
  • domain assumption Human creativity tests retain sufficient construct validity when administered to LLMs
    Invoked throughout the comparison of test effectiveness across constructs.
  • domain assumption Scientific ideation can be reliably scored from LLM-generated outputs
    Required for the claim that no prior test predicts it while DRAT does.
invented entities (1)
  • Divergent Remote Association Test (DRAT) no independent evidence
    purpose: Single instrument assessing both convergent and divergent thinking to predict scientific ideation
    New test introduced in the paper; no independent evidence outside this work is provided.

pith-pipeline@v0.9.0 · 5590 in / 1434 out tokens · 26227 ms · 2026-05-14T19:06:08.475895+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 22 canonical work pages · 5 internal anchors

  1. [1]

    doi: 10.1016/j.tsc.2021.100859

    ISSN 18711871. doi: 10.1016/j.tsc.2021.100859. Bellemare-Pepin, A., Lespinasse, F., Thölke, P., Harel, Y., Mathewson, K., Olson, J. A., Bengio, Y., and Jerbi, K. Divergent Creativity in Humans and LLMs. Technical report,

  2. [2]

    20 Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A

    URLhttp://arxiv.org/abs/2310.11158. 20 Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhang, H., Zhu, B., Jordan, M., Gonzalez, J. E., et al. Chatbot arena: An open platform for evaluating llms by human preference.arXiv preprint arXiv:2403.04132,

  3. [3]

    On the Measure of Intelligence

    Chollet, F. On the measure of intelligence.arXiv preprint arXiv:1911.01547,

  4. [4]

    doi: 10.3758/s13423-018-1517-7

    ISSN 15315320. doi: 10.3758/s13423-018-1517-7. Gerwig, A., Miroshnik, K., Forthmann, B., Benedek, M., Karwowski, M., and Holling, H. The relationship between intelligence and divergent thinking—a meta-analytic update.Journal of Intelligence, 9(2):23,

  5. [5]

    CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge

    URLhttps://arxiv.org/ abs/2604.03374. Jiang, L., Chai, Y., Li, M., Liu, M., Fok, R., Dziri, N., Tsvetkov, Y., Sap, M., Albalak, A., and Choi, Y. Artificial hivemind: The open-ended homogeneity of language models (and beyond),

  6. [6]

    URL https://arxiv.org/abs/2510.22954. Kapoor, S., Kirgis, P., Schwartz, A., Rabanser, S., Allaire, J., Bommasani, R., Dubois, M., Hadfield, G., Hall, A., Hooker, S., Lazar, S., Newman, S., Papailiopoulos, D., Tekofsky, S., Toner, H., Ududec, C., and Narayanan, A. Open-world evaluations for measuring frontier AI capabilities.https://cruxevals. com/open-wor...

  7. [7]

    doi: 10.1038/s41598-023-40858-3

    ISSN 20452322. doi: 10.1038/s41598-023-40858-3. Maher,M.L. Evaluatingcreativityinhumans,computers,andcollectivelyintelligentsystems. InProceedings of the 1st DESIRE Network Conference on Creativity and Innovation in Design, DESIRE ’10, pp. 22–28,

  8. [8]

    Advances in pre-training distributed word representations

    Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and Joulin, A. Advances in pre-training distributed word representations. InProceedings of the eleventh international conference on language resources and evaluation (LREC 2018),

  9. [9]

    Nagarajan, V., Wu, C

    URLhttps://github.com/TaatiTeam/OCW. Nagarajan, V., Wu, C. H., Ding, C., and Raghunathan, A. Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction. arXiv:2504.15266,

  10. [10]

    Beyond divergent creativity: A human-based evaluation of creativity in large language models.arXiv preprint arXiv:2601.20546,

    Nakajima, K., Zuiderveld, J., and Pezzelle, S. Beyond divergent creativity: A human-based evaluation of creativity in large language models.arXiv preprint arXiv:2601.20546,

  11. [11]

    Progress measures for grokking via mechanistic interpretability

    Nanda,N.,Chan,L.,Lieberum,T.,Smith,J.,andSteinhardt,J. Progressmeasuresforgrokkingviamechanistic interpretability.arXiv preprint arXiv:2301.05217,

  12. [12]

    doi: 10.1073/pnas.2022340118/-/DCSupplemental.y. Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish,...

  13. [13]

    In-context Learning and Induction Heads

    URLhttps://arxiv.org/abs/2209.11895. Paech, S. J. Eq-bench: An emotional intelligence benchmark for large language models.arXiv preprint arXiv:2312.06281,

  14. [14]

    Glove: Globalvectorsforwordrepresentation

    Pennington,J.,Socher,R.,andManning,C.D. Glove: Globalvectorsforwordrepresentation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543,

  15. [15]

    and Hu, R

    22 Qiu, Z. and Hu, R. Deep associations, high creativity: A simple yet effective metric for evaluating large language models. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 10859–10872, Suzhou, China, November

  16. [16]

    Deep Associations, High Creativity: A Simple yet Effective Metric for Evaluating Large Language Models

    Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.550. URLhttps://aclanthology.org/2025.emnlp-main.550/. Ruan, K., Wang, X., Hong, J., Wang, P., Liu, Y., and Sun, H. Evaluating llms’ divergent thinking capabilities for scientific idea generation with minimal context.Nature Communications,

  17. [17]

    Schapiro, S., Black, J., and Varshney, L. R. Transformational Creativity in Science: A Graphical Theory. arXiv preprint arXiv:2504.18687, 2025a. Schapiro, S., Shashidhar, S., Gladstone, A., Black, J., Moon, R., Hakkani-Tur, D., and Varshney, L. R. Combinatorial Creativity: A New Frontier in Generalization Abilities. 11 2025b. URLhttp://arxiv. org/abs/2509...

  18. [18]

    Si, C., Hashimoto, T., and Yang, D

    URLhttp://arxiv.org/abs/2409.04109. Si, C., Hashimoto, T., and Yang, D. The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas. 6

  19. [19]

    Simonton, D

    URLhttp://arxiv.org/abs/2506.20803. Simonton, D. K.Creativity in Science: Chance, Logic, Genius, and Zeitgeist. Cambridge University Press,

  20. [20]

    Sun, Y., Hu, S., Zhou, G., Zheng, K., Hajishirzi, H., Dziri, N., and Song, D

    URL http://osf.io/vmk3c/. Sun, Y., Hu, S., Zhou, G., Zheng, K., Hajishirzi, H., Dziri, N., and Song, D. Omega: Can llms reason outside the box in math? evaluating exploratory, compositional, and transformative generalization.arXiv preprint arXiv:2506.18880,

  21. [21]

    Creative Combination of Representations: Scientific Discovery and Technological Invention

    Thagard, P. Creative Combination of Representations: Scientific Discovery and Technological Invention. In PsychologyofScience: ImplicitandExplicitProcesses.OxfordUniversityPress,2012. doi: 10.1093/acprof: oso/9780199753628.003.0016. 23 Torrance, E. P. Torrance tests of creative thinking.Educational and psychological measurement,

  22. [22]

    doi: 10.1147/JRD.2019.2893907

    ISSN 21518556. doi: 10.1147/JRD.2019.2893907. Wadhwa, M., Roy, T. S., Lederman, H., Li, J.J., and Durrett, G. Create: Testing llmsfor associative creativity. arXiv preprint arXiv:2603.09970,

  23. [23]

    Wang, Q., Downey, D., Ji, H., and Hope, T. SciMON: Scientific Inspiration Machines Optimized for Novelty.62024a.doi: 10.18653/v1/2024.acl-long.18.URL http://arxiv.org/abs/2305.14259http: //dx.doi.org/10.18653/v1/2024.acl-long.18. Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., et al. Mmlu-pro: A more rob...

  24. [24]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yue, Y., Chen, Z., Lu, R., Zhao, A., Wang, Z., Song, S., and Huang, G. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

  25. [25]

    URLhttps://arxiv.org/abs/2504.05228. 24 Appendix Contents A Derivation of the Validity-Specificity Frontier 25 B Per-Model Test Scores 27 C Per-Model Benchmark Scores 29 D Greedy Algorithm for the DAT 31 E Prompts 32 E.1 DAT 33 E.2 CDAT (cue: rock) 33 E.3 PACE Stage 1 (seed: rock) 33 E.4 PACE Stage 2 (seed: rock, first-association: stone) 33 E.5 RAT (stem...

  26. [26]

    (12)) (28) = 𝑣 √︁ 1−𝑅 2 − |𝑅| √︁ 1−𝑣 2, 𝑣 √︁ 1−𝑅 2 + |𝑅| √︁ 1−𝑣 2 ,(29) which gives the bound in Equation (5)

    Then, expanding𝜎(𝑌− ˆ𝑌𝑔)yields: 𝜎(𝑌− ˆ𝑌𝑔)= √︃ Var(𝑌− ˆ𝑌𝑔)(22) = √︃ Var(𝑌) +Var( ˆ𝑌𝑔) −2Cov(𝑌 , ˆ𝑌𝑔)(23) = √︃ 1+𝑅 2 −2·corr(𝑌 , ˆ𝑌𝑔) ·𝜎( ˆ𝑌𝑔) ·𝜎(𝑌) (Since Cov(𝑌 , ˆ𝑌𝑔 ) 𝜎(𝑌)𝜎( ˆ𝑌𝑔 ) =corr(𝑌 , ˆ𝑌𝑔)=:𝑅)(24) = √︁ 1+𝑅 2 −2𝑅 2 (25) = √︁ 1−𝑅 2 (26) 26 Substituting back into (16) and combining with the bounds on𝑎from Equation (12), we obtain 𝑟(𝑋, 𝑌|𝑔)= 𝑣−𝑎𝑅√ 1−𝑅 ...

  27. [27]

    Hivemind diversity

    and NoveltyBench Utility (NovB.; cumulative-utility score) (Zhang et al., 2025), and the scientific ideation benchmark is the LiveIdeaBench (Ruan et al., 2026)idea score average across five dimensions: originality, flexibility, feasibility, clarity, and fluency. Cells marked “—” are missing because the corresponding benchmark does not score that model—fut...

  28. [28]

    60 65 70 75 80 85 90 95 100 DAT score 0.0 0.1 0.2 0.3 0.4 density Humans (GloVe only, Olson 2021, n =

    As shown, the algorithm trivially exceeds the distribution of LLM and human scores. 60 65 70 75 80 85 90 95 100 DAT score 0.0 0.1 0.2 0.3 0.4 density Humans (GloVe only, Olson 2021, n =

  29. [29]

    rock"forillustration. TheRATusesthesamefixedtemplateacrossall 30 items; we show one instantiation with the stems

    Figure 11A simple algorithm outperforms humans and LLMs on the DAT.Three distributions are reported in this plot: the DAT score distributions for humans (Olson et al. (2021), Study 1A,𝑛=141 , GloVe-scored; mean=78.4 , std. dev.=6.4 ), the distribution over our 54-model LLM pool (𝑛=2078 trials at𝑇=1.0 ; mean=83.75 , std. dev. =5.13 ), and the results of ou...