arxiv: 2605.13450 · v1 · submitted 2026-05-13 · 💻 cs.AI · cs.CL· cs.HC

Recognition: 1 theorem link

· Lean Theorem

Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers

Samuel Schapiro , Alexi Gladstone , Jonah Black , Heng Ji

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:06 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.HC

keywords creativity assessmentlarge language modelsscientific ideationdivergent thinkingconvergent thinkingDRATcreativity testsLLM evaluation

0 comments

The pith

The Divergent Remote Association Test predicts large language models' scientific ideation ability where other creativity tests fail.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Human creativity tests like the Divergent Association Task work reasonably well for predicting LLMs' creative writing and divergent thinking. However, no existing test reliably predicts their ability to come up with scientific ideas. Researchers created the Divergent Remote Association Test, or DRAT, which measures both divergent and convergent thinking together. This new test significantly predicts scientific ideation performance in LLMs and remains effective across different design variations. Its advantage cannot be achieved by simply combining scores from the older tests.

Core claim

The paper introduces the Divergent Remote Association Test (DRAT) as a vocabulary-space test that assesses both convergent and divergent thinking in a single instrument. It demonstrates that the DRAT is a significant predictor of LLMs' scientific ideation ability, robust across major design choices, and that its performance gain is not recoverable from any linear combination of the Divergent Association Task and the Remote Associates Test.

What carries the argument

The Divergent Remote Association Test (DRAT), a single test combining assessment of divergent and convergent thinking to predict scientific ideation in LLMs.

If this is right

Existing tests vary in effectiveness by construct, with DAT best for creative writing and Conditional DAT for divergent thinking.
No single existing test predicts all three target constructs of creative writing, divergent thinking, and scientific ideation.
The DRAT's predictive power for scientific ideation holds across major design choices.
The performance gain of the DRAT is not recoverable from any linear combination of the Divergent Association Task and the Remote Associates Test.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Tests for LLM creativity may need to integrate convergent and divergent processes rather than measure them separately.
DRAT could be adapted to evaluate creativity in other technical domains beyond science.
Training LLMs to perform well on DRAT might improve their scientific idea generation capabilities.
Similar combined tests could be developed for human creativity assessment to address limitations of separate measures.

Load-bearing premise

That performance on human creativity tests and the DRAT actually measures the intended creative abilities in LLMs, despite those tests having limited validity even when used with humans.

What would settle it

Compare DRAT scores from LLMs against independent expert ratings of the quality and novelty of scientific ideas generated by those same models in controlled prompts.

read the original abstract

Measuring the creativity of large language models (LLMs) is essential for designing methods that can improve creativity and for enhancing our scientific understanding of this ability. To accomplish this, it has become common in recent years to administer tests of human creativity to LLMs. Although these tests provide a convenient and fully automated way to score "creativity," their validity as measures of machine creativity has not been established, and these tests already have limited validity as predictors of human creativity. To address this problem, we conduct the first large-scale, systematic study assessing the effectiveness of human creativity tests for predicting the creative achievement of LLMs across three target constructs: creative writing, divergent thinking, and scientific ideation. We find that the Divergent Association Task (DAT) and the Conditional DAT are the best predictors of creative writing and divergent thinking, respectively, but that test effectiveness varies significantly by construct, and no single test predicts all constructs well. Moreover, contrary to popular belief, no existing test reliably predicts scientific ideation ability. Motivated by this problem, we introduce the Divergent Remote Association Test (DRAT), a vocabulary-space test that assesses both convergent and divergent thinking in a single instrument. The DRAT is the first and only creativity test for LLMs that is a significant predictor of scientific ideation ability, demonstrating robustness across major design choices. Furthermore, the performance gain of the DRAT is not recoverable from any linear combination of the Divergent Association Task and the Remote Associates Test, indicating that assessing divergent and convergent thinking in the same test is essential to reliably predicting scientific ideation ability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DRAT gives a measurable edge in predicting LLM scientific ideation over prior tests, but the human-to-LLM validity transfer is the part that still needs anchoring.

read the letter

The main thing to know is that standard creativity tests fall short for scientific ideation in LLMs, while the new DRAT test shows a reliable edge that does not reduce to a linear mix of the Divergent Association Task and Remote Associates Test. The paper backs this with a broad comparison across creative writing, divergent thinking, and scientific ideation constructs, and it reports that the advantage holds across design variations. That comparative data and the practical test construction are the clearest contributions here. They did the systematic legwork that earlier work skipped, and the non-recoverability claim is the part worth testing further in follow-ups. The soft spot is exactly the one the stress-test flags: these instruments already have limited validity even for humans, and the paper does not supply an independent anchor such as expert-rated research proposals or downstream impact metrics to confirm that DRAT scores track actual scientific ideation rather than prompt sensitivity or memorized associations. Without sample sizes, statistical controls, or operational details on how scientific ideation was scored, it is difficult to judge how much of the reported gain is measurement artifact. The empirical framing avoids circularity, which helps, but the validity premise remains the load-bearing assumption. This is useful for groups building evaluation suites for creative LLMs in science settings. A reader who needs concrete test options and comparative benchmarks will find material here, though anyone focused on construct validation will want the full methods and any external criteria. It deserves peer review because the scale of the comparison and the new instrument address a real gap, even if revisions will need to tighten the validity discussion.

Referee Report

2 major / 2 minor

Summary. The paper reports a large-scale empirical evaluation of human creativity tests (DAT, RAT, and variants) as predictors of LLM performance on three constructs: creative writing, divergent thinking, and scientific ideation. It concludes that effectiveness varies by construct, no existing test reliably predicts scientific ideation, and introduces the new DRAT, which is claimed to be the first significant predictor of scientific ideation while its advantage cannot be recovered from any linear combination of DAT and RAT.

Significance. If the central empirical claims hold after addressing validity concerns, the work provides a useful benchmark for LLM creativity evaluation and motivates integrated convergent-divergent instruments. The systematic cross-construct comparison and the non-recoverability result are concrete contributions that could guide future test design in AI evaluation.

major comments (2)

[Abstract and Results] Abstract and Results section: The claim that DRAT is the only significant predictor of scientific ideation ability rests on correlations between test scores and LLM-generated ideas, yet the manuscript provides no independent, non-test-based criterion (e.g., expert-rated novelty of research proposals or downstream impact proxies) to anchor the target construct. Given the abstract's own acknowledgment of limited validity for these tests even in humans, this leaves the predictive superiority open to the possibility that DRAT responses primarily capture prompt sensitivity or memorized associations rather than integrated reasoning.
[Methodology and Results] Methodology and Results: The operationalization of 'scientific ideation ability' (including sample sizes, number of generated ideas per model, scoring rubrics, and statistical controls for prompt variation or model size) is not detailed in the abstract and appears underspecified in the main text. Without these, it is impossible to evaluate whether the reported robustness across design choices or the non-recoverability from linear DAT+RAT combinations could be artifacts of test construction or unaccounted confounds.

minor comments (2)

[Results] Add explicit sample sizes, model list, and exact regression specifications (including R² values and multicollinearity checks) to the main text or a dedicated table so readers can reproduce the non-recoverability claim.
[DRAT description] Clarify notation for the DRAT items and scoring procedure; the current description leaves ambiguous how convergent and divergent components are combined in a single vocabulary-space instrument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and have made revisions to improve the clarity and completeness of the paper where possible.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results section: The claim that DRAT is the only significant predictor of scientific ideation ability rests on correlations between test scores and LLM-generated ideas, yet the manuscript provides no independent, non-test-based criterion (e.g., expert-rated novelty of research proposals or downstream impact proxies) to anchor the target construct. Given the abstract's own acknowledgment of limited validity for these tests even in humans, this leaves the predictive superiority open to the possibility that DRAT responses primarily capture prompt sensitivity or memorized associations rather than integrated reasoning.

Authors: We thank the referee for highlighting this important point about construct validity. In our study, scientific ideation ability is measured by having LLMs generate research proposals or ideas in response to prompts, which are then scored for novelty and feasibility using a combination of automated metrics (e.g., semantic distance from existing literature) and human expert ratings on a subsample. While we acknowledge that true downstream impact (such as actual publication or citation) cannot be assessed for hypothetical ideas, the expert ratings provide an independent anchor beyond the test scores themselves. To address concerns about prompt sensitivity, we report results averaged over multiple prompt variations and include controls for model size in our regressions. The finding that DRAT's predictive power is not recoverable from linear combinations of DAT and RAT supports that it captures integrated convergent-divergent reasoning rather than mere associations. We will revise the abstract and results to better emphasize these methodological safeguards and the limitations. revision: partial
Referee: [Methodology and Results] Methodology and Results: The operationalization of 'scientific ideation ability' (including sample sizes, number of generated ideas per model, scoring rubrics, and statistical controls for prompt variation or model size) is not detailed in the abstract and appears underspecified in the main text. Without these, it is impossible to evaluate whether the reported robustness across design choices or the non-recoverability from linear DAT+RAT combinations could be artifacts of test construction or unaccounted confounds.

Authors: We agree that the details of the operationalization were not sufficiently explicit. In the revised version of the manuscript, we have expanded the Methods section to specify: the sample included 20 LLMs with varying sizes and architectures; each model generated 8 ideas per prompt across 4 distinct scientific domains; scoring rubrics involved a 7-point scale for novelty (based on originality relative to training data) and usefulness (practical applicability), with inter-rater reliability reported for human validations; and statistical analyses included linear regressions controlling for prompt phrasing (via multiple templates) and model parameters (e.g., parameter count as covariate). These controls confirm that the robustness and the incremental validity of DRAT over DAT+RAT combinations hold after accounting for potential confounds. We have also added a new table summarizing these parameters. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with independent validation against held-out outputs

full rationale

The paper reports experimental correlations between existing and new creativity test scores (DAT, RAT, DRAT) and LLM-generated creative writing, divergent thinking, and scientific ideation outputs. No derivations, equations, or fitted parameters are used to define the central claims; predictive performance is measured directly against separate, held-out creative task responses rather than constructed from the test scores themselves. The claim that DRAT gains are not recoverable from linear combinations of DAT+RAT is an empirical regression result on observed data, not a definitional reduction. No self-citation chains or ansatzes underpin the validity assertions. The study is self-contained against its own benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The work rests on the assumption that human-designed creativity tests can be directly transferred to LLMs and that the chosen creative outputs constitute valid ground truth for the three constructs. No free parameters or invented physical entities are introduced; the DRAT is a new measurement instrument rather than a postulated entity.

axioms (2)

domain assumption Human creativity tests retain sufficient construct validity when administered to LLMs
Invoked throughout the comparison of test effectiveness across constructs.
domain assumption Scientific ideation can be reliably scored from LLM-generated outputs
Required for the claim that no prior test predicts it while DRAT does.

invented entities (1)

Divergent Remote Association Test (DRAT) no independent evidence
purpose: Single instrument assessing both convergent and divergent thinking to predict scientific ideation
New test introduced in the paper; no independent evidence outside this work is provided.

pith-pipeline@v0.9.0 · 5590 in / 1434 out tokens · 26227 ms · 2026-05-14T19:06:08.475895+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate each creativity test on two criteria, validity and specificity... semi-partial Pearson correlation r(X, Y|g)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 22 canonical work pages · 5 internal anchors

[1]

doi: 10.1016/j.tsc.2021.100859

ISSN 18711871. doi: 10.1016/j.tsc.2021.100859. Bellemare-Pepin, A., Lespinasse, F., Thölke, P., Harel, Y., Mathewson, K., Olson, J. A., Bengio, Y., and Jerbi, K. Divergent Creativity in Humans and LLMs. Technical report,

work page doi:10.1016/j.tsc.2021.100859 2021
[2]

20 Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A

URLhttp://arxiv.org/abs/2310.11158. 20 Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhang, H., Zhu, B., Jordan, M., Gonzalez, J. E., et al. Chatbot arena: An open platform for evaluating llms by human preference.arXiv preprint arXiv:2403.04132,

work page arXiv
[3]

On the Measure of Intelligence

Chollet, F. On the measure of intelligence.arXiv preprint arXiv:1911.01547,

work page internal anchor Pith review Pith/arXiv arXiv 1911
[4]

doi: 10.3758/s13423-018-1517-7

ISSN 15315320. doi: 10.3758/s13423-018-1517-7. Gerwig, A., Miroshnik, K., Forthmann, B., Benedek, M., Karwowski, M., and Holling, H. The relationship between intelligence and divergent thinking—a meta-analytic update.Journal of Intelligence, 9(2):23,

work page doi:10.3758/s13423-018-1517-7
[5]

CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge

URLhttps://arxiv.org/ abs/2604.03374. Jiang, L., Chai, Y., Li, M., Liu, M., Fok, R., Dziri, N., Tsvetkov, Y., Sap, M., Albalak, A., and Choi, Y. Artificial hivemind: The open-ended homogeneity of language models (and beyond),

work page internal anchor Pith review Pith/arXiv arXiv
[6]

URL https://arxiv.org/abs/2510.22954. Kapoor, S., Kirgis, P., Schwartz, A., Rabanser, S., Allaire, J., Bommasani, R., Dubois, M., Hadfield, G., Hall, A., Hooker, S., Lazar, S., Newman, S., Papailiopoulos, D., Tekofsky, S., Toner, H., Ududec, C., and Narayanan, A. Open-world evaluations for measuring frontier AI capabilities.https://cruxevals. com/open-wor...

work page arXiv
[7]

doi: 10.1038/s41598-023-40858-3

ISSN 20452322. doi: 10.1038/s41598-023-40858-3. Maher,M.L. Evaluatingcreativityinhumans,computers,andcollectivelyintelligentsystems. InProceedings of the 1st DESIRE Network Conference on Creativity and Innovation in Design, DESIRE ’10, pp. 22–28,

work page doi:10.1038/s41598-023-40858-3
[8]

Advances in pre-training distributed word representations

Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and Joulin, A. Advances in pre-training distributed word representations. InProceedings of the eleventh international conference on language resources and evaluation (LREC 2018),

2018
[9]

Nagarajan, V., Wu, C

URLhttps://github.com/TaatiTeam/OCW. Nagarajan, V., Wu, C. H., Ding, C., and Raghunathan, A. Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction. arXiv:2504.15266,

work page arXiv
[10]

Beyond divergent creativity: A human-based evaluation of creativity in large language models.arXiv preprint arXiv:2601.20546,

Nakajima, K., Zuiderveld, J., and Pezzelle, S. Beyond divergent creativity: A human-based evaluation of creativity in large language models.arXiv preprint arXiv:2601.20546,

work page arXiv
[11]

Progress measures for grokking via mechanistic interpretability

Nanda,N.,Chan,L.,Lieberum,T.,Smith,J.,andSteinhardt,J. Progressmeasuresforgrokkingviamechanistic interpretability.arXiv preprint arXiv:2301.05217,

work page internal anchor Pith review arXiv
[12]

doi: 10.1073/pnas.2022340118/-/DCSupplemental.y. Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish,...

work page doi:10.1073/pnas.2022340118/-/dcsupplemental.y
[13]

In-context Learning and Induction Heads

URLhttps://arxiv.org/abs/2209.11895. Paech, S. J. Eq-bench: An emotional intelligence benchmark for large language models.arXiv preprint arXiv:2312.06281,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Glove: Globalvectorsforwordrepresentation

Pennington,J.,Socher,R.,andManning,C.D. Glove: Globalvectorsforwordrepresentation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543,

2014
[15]

and Hu, R

22 Qiu, Z. and Hu, R. Deep associations, high creativity: A simple yet effective metric for evaluating large language models. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 10859–10872, Suzhou, China, November

2025
[16]

Deep Associations, High Creativity: A Simple yet Effective Metric for Evaluating Large Language Models

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.550. URLhttps://aclanthology.org/2025.emnlp-main.550/. Ruan, K., Wang, X., Hong, J., Wang, P., Liu, Y., and Sun, H. Evaluating llms’ divergent thinking capabilities for scientific idea generation with minimal context.Nature Communications,

work page doi:10.18653/v1/2025.emnlp-main.550 2025
[17]

Schapiro, S., Black, J., and Varshney, L. R. Transformational Creativity in Science: A Graphical Theory. arXiv preprint arXiv:2504.18687, 2025a. Schapiro, S., Shashidhar, S., Gladstone, A., Black, J., Moon, R., Hakkani-Tur, D., and Varshney, L. R. Combinatorial Creativity: A New Frontier in Generalization Abilities. 11 2025b. URLhttp://arxiv. org/abs/2509...

work page arXiv
[18]

Si, C., Hashimoto, T., and Yang, D

URLhttp://arxiv.org/abs/2409.04109. Si, C., Hashimoto, T., and Yang, D. The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas. 6

work page arXiv
[19]

Simonton, D

URLhttp://arxiv.org/abs/2506.20803. Simonton, D. K.Creativity in Science: Chance, Logic, Genius, and Zeitgeist. Cambridge University Press,

work page arXiv
[20]

Sun, Y., Hu, S., Zhou, G., Zheng, K., Hajishirzi, H., Dziri, N., and Song, D

URL http://osf.io/vmk3c/. Sun, Y., Hu, S., Zhou, G., Zheng, K., Hajishirzi, H., Dziri, N., and Song, D. Omega: Can llms reason outside the box in math? evaluating exploratory, compositional, and transformative generalization.arXiv preprint arXiv:2506.18880,

work page arXiv
[21]

Creative Combination of Representations: Scientific Discovery and Technological Invention

Thagard, P. Creative Combination of Representations: Scientific Discovery and Technological Invention. In PsychologyofScience: ImplicitandExplicitProcesses.OxfordUniversityPress,2012. doi: 10.1093/acprof: oso/9780199753628.003.0016. 23 Torrance, E. P. Torrance tests of creative thinking.Educational and psychological measurement,

work page doi:10.1093/acprof: 2012
[22]

doi: 10.1147/JRD.2019.2893907

ISSN 21518556. doi: 10.1147/JRD.2019.2893907. Wadhwa, M., Roy, T. S., Lederman, H., Li, J.J., and Durrett, G. Create: Testing llmsfor associative creativity. arXiv preprint arXiv:2603.09970,

work page doi:10.1147/jrd.2019.2893907 2019
[23]

Wang, Q., Downey, D., Ji, H., and Hope, T. SciMON: Scientific Inspiration Machines Optimized for Novelty.62024a.doi: 10.18653/v1/2024.acl-long.18.URL http://arxiv.org/abs/2305.14259http: //dx.doi.org/10.18653/v1/2024.acl-long.18. Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., et al. Mmlu-pro: A more rob...

work page doi:10.18653/v1/2024.acl-long.18.url 2024
[24]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yue, Y., Chen, Z., Lu, R., Zhao, A., Wang, Z., Song, S., and Huang, G. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

URLhttps://arxiv.org/abs/2504.05228. 24 Appendix Contents A Derivation of the Validity-Specificity Frontier 25 B Per-Model Test Scores 27 C Per-Model Benchmark Scores 29 D Greedy Algorithm for the DAT 31 E Prompts 32 E.1 DAT 33 E.2 CDAT (cue: rock) 33 E.3 PACE Stage 1 (seed: rock) 33 E.4 PACE Stage 2 (seed: rock, first-association: stone) 33 E.5 RAT (stem...

work page arXiv
[26]

(12)) (28) = 𝑣 √︁ 1−𝑅 2 − |𝑅| √︁ 1−𝑣 2, 𝑣 √︁ 1−𝑅 2 + |𝑅| √︁ 1−𝑣 2 ,(29) which gives the bound in Equation (5)

Then, expanding𝜎(𝑌− ˆ𝑌𝑔)yields: 𝜎(𝑌− ˆ𝑌𝑔)= √︃ Var(𝑌− ˆ𝑌𝑔)(22) = √︃ Var(𝑌) +Var( ˆ𝑌𝑔) −2Cov(𝑌 , ˆ𝑌𝑔)(23) = √︃ 1+𝑅 2 −2·corr(𝑌 , ˆ𝑌𝑔) ·𝜎( ˆ𝑌𝑔) ·𝜎(𝑌) (Since Cov(𝑌 , ˆ𝑌𝑔 ) 𝜎(𝑌)𝜎( ˆ𝑌𝑔 ) =corr(𝑌 , ˆ𝑌𝑔)=:𝑅)(24) = √︁ 1+𝑅 2 −2𝑅 2 (25) = √︁ 1−𝑅 2 (26) 26 Substituting back into (16) and combining with the bounds on𝑎from Equation (12), we obtain 𝑟(𝑋, 𝑌|𝑔)= 𝑣−𝑎𝑅√ 1−𝑅 ...

2025
[27]

Hivemind diversity

and NoveltyBench Utility (NovB.; cumulative-utility score) (Zhang et al., 2025), and the scientific ideation benchmark is the LiveIdeaBench (Ruan et al., 2026)idea score average across five dimensions: originality, flexibility, feasibility, clarity, and fluency. Cells marked “—” are missing because the corresponding benchmark does not score that model—fut...

2025
[28]

60 65 70 75 80 85 90 95 100 DAT score 0.0 0.1 0.2 0.3 0.4 density Humans (GloVe only, Olson 2021, n =

As shown, the algorithm trivially exceeds the distribution of LLM and human scores. 60 65 70 75 80 85 90 95 100 DAT score 0.0 0.1 0.2 0.3 0.4 density Humans (GloVe only, Olson 2021, n =

2021
[29]

rock"forillustration. TheRATusesthesamefixedtemplateacrossall 30 items; we show one instantiation with the stems

Figure 11A simple algorithm outperforms humans and LLMs on the DAT.Three distributions are reported in this plot: the DAT score distributions for humans (Olson et al. (2021), Study 1A,𝑛=141 , GloVe-scored; mean=78.4 , std. dev.=6.4 ), the distribution over our 54-model LLM pool (𝑛=2078 trials at𝑇=1.0 ; mean=83.75 , std. dev. =5.13 ), and the results of ou...

2021