Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

David Garcia; Dirk U. Wulff; Jelena Meyer

arxiv: 2606.20205 · v1 · pith:VQLGUZ4Dnew · submitted 2026-06-18 · 💻 cs.AI · cs.CL· cs.HC

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

Jelena Meyer , David Garcia , Dirk U. Wulff This is my paper

Pith reviewed 2026-06-26 17:31 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.HC

keywords large language modelspsychological profilesmeasurement artifactdirectional response biaspsychometric frameworkresponse orthogonalityvariance decomposition

0 comments

The pith

Apparent psychological profiles of large language models arise mostly from directional response bias rather than the traits the tests target.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper administers personality and risk-preference instruments to 56 instruction-tuned LLMs and human samples, then applies a formal psychometric variance decomposition. It finds that between-model differences are driven by a directional response bias toward one scale end or labeled option, accounting for 81-90 percent of variation versus 9-16 percent in humans. This bias declines with model capability but remains. Because bias dominates, an instrument's apparent reliability is predicted almost entirely by its response orthogonality—the share of items where trait and bias point oppositely—and model profiles shift or can be manufactured by changing the item set. The result is that borrowed human instruments produce artifacts rather than valid LLM traits.

Core claim

Using a battery of self-report and behavioral instruments, the authors show that differences between LLMs are driven not by targeted traits but by directional response bias; variance decomposition attributes 81-90% of between-model variation to this bias. The bias declines with capability but is not eliminated. Instrument reliability is almost entirely predicted by response orthogonality, and profiles shift with item selection or can be manufactured through it.

What carries the argument

Directional response bias (tendency to respond toward one end of the scale regardless of item content), isolated through variance decomposition within a formal psychometric framework applied to self-report and behavioral tasks.

If this is right

Apparent reliability of any instrument for LLMs is almost entirely predicted by its response orthogonality.
Model profiles shift with the items used and can be manufactured through item selection.
The bias declines with model capability but is not eliminated by it.
Instruments borrowed from human psychology rarely achieve full orthogonality and may lack validity for LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dedicated LLM assessments should be built around response orthogonality rather than direct adoption of human scales.
Using LLMs as proxies for human participants in psychological research may introduce systematic artifacts traceable to response bias.
Similar directional biases could affect other survey-style evaluations of AI systems beyond personality and risk measures.

Load-bearing premise

Directional response bias can be cleanly separated from the targeted trait effects by the chosen instruments and the variance decomposition method.

What would settle it

A replication in which the same variance decomposition on new LLMs and instruments attributes most between-model variation to the intended traits rather than directional bias, or in which model profiles remain stable when items are swapped for others measuring the same trait.

read the original abstract

Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument's apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core result is that directional response bias drives 81-90% of between-model differences on standard psych instruments, far above the human baseline, but the clean isolation of that bias from trait effects is the assumption most open to question for LLMs.

read the letter

The main point here is that when you give LLMs personality and risk tests, the differences you see across models mostly trace to a simple tendency to pick one end of the scale or one labeled option, not to the actual traits the items target. The variance split they report (81-90% bias for models, 9-16% for humans) is the striking number, and they back it with data from 56 models plus human samples.

What stands out as new is the variance decomposition itself plus the response orthogonality measure they introduce to predict when an instrument will look reliable. The finding that you can shift a model's apparent profile just by picking different items is also useful and directly challenges the practice of treating these scores as stable model properties.

The work does a few things right. It includes human reference data for comparison, covers both self-report and behavioral tasks, and gives concrete percentages rather than vague claims. The call for instruments built around orthogonality follows logically from their results.

The soft spot is the isolation step. The decomposition treats bias as additive and independent of item content, but LLMs generate answers from full token sequences where semantics, scale labels, and training can create correlations the method might fold into the bias term. The paper does not appear to report explicit checks for content-by-bias interactions, so the 81-90% figure rests on an assumption that may not hold as cleanly for next-token models as it does for humans. Without the full methods and item lists it is hard to judge how robust that partition is.

This paper is aimed at anyone using off-the-shelf psych tests on LLMs for safety, usability, or research proxy work. It deserves a serious referee because the empirical pattern is specific enough to test and the practical implication is large if the decomposition holds up.

Referee Report

2 major / 2 minor

Summary. The paper administers a battery of personality and risk-preference instruments (self-report and behavioral) to 56 instruction-tuned LLMs and human reference samples. It reports four findings: (1) between-model differences are driven primarily by directional response bias (tendency toward one scale end or labeled option, independent of item content), with a variance decomposition attributing 81-90% of between-model variation to this bias versus 9-16% in humans; (2) the bias declines with model capability but is not eliminated; (3) instrument reliability is almost entirely predicted by 'response orthogonality' (proportion of items where trait and bias oppose); (4) apparent profiles shift with item selection and can be manufactured. The conclusion is that LLM psychological profiles are measurement artifacts, not model properties, and calls for new assessments centered on response orthogonality.

Significance. If the variance decomposition and orthogonality results hold under scrutiny, the work would substantially undermine the use of borrowed human psychological instruments for profiling LLMs in usability, safety, and research-proxy contexts. It supplies concrete empirical numbers from a large model sample and introduces a falsifiable construct (response orthogonality) that directly predicts reliability, offering a clear path for improved instrumentation.

major comments (2)

[variance decomposition / findings 1] The central variance decomposition (abstract and § on findings) attributes 81-90% of between-model variation to directional bias under the assumption that bias is additive and orthogonal to trait effects. For LLMs this isolation is not obviously guaranteed, because next-token generation conditions on full item semantics and scale labels; any content-by-bias interaction would be misattributed to pure directional bias. Explicit residual diagnostics or simulation checks confirming orthogonality across the administered items are required to support the reported percentages.
[response orthogonality definition and reliability prediction] Response orthogonality is defined and used to predict reliability (finding 3), yet the manuscript provides no direct test that the proportion of opposing items is independent of the specific trait scales or LLM prompting regime. If orthogonality itself covaries with item content or model capability, the predictive claim would be circular and the 81-90% bias attribution would be overstated.

minor comments (2)

[methods] The abstract states four findings but the methods description of how bias was isolated from trait effects and how orthogonality was computed is referenced only at high level; a dedicated methods subsection with item lists, exact variance formulas, and human-LLM comparison tables would improve reproducibility.
[finding 4] The claim that profiles 'can be manufactured through item selection' (finding 4) is important but would benefit from a supplementary table showing the exact item subsets and resulting profile shifts for at least two instruments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments on the assumptions underlying our variance decomposition and the independence of the response orthogonality measure. We respond to each point below.

read point-by-point responses

Referee: [variance decomposition / findings 1] The central variance decomposition (abstract and § on findings) attributes 81-90% of between-model variation to directional bias under the assumption that bias is additive and orthogonal to trait effects. For LLMs this isolation is not obviously guaranteed, because next-token generation conditions on full item semantics and scale labels; any content-by-bias interaction would be misattributed to pure directional bias. Explicit residual diagnostics or simulation checks confirming orthogonality across the administered items are required to support the reported percentages.

Authors: We agree that the reported variance decomposition assumes additivity and orthogonality of directional bias to trait effects, and that next-token prediction could in principle introduce content-by-bias interactions. Our decomposition was obtained by fitting a linear model that separates an item-specific trait direction term from a model-specific directional bias term; the large between-model variance component attributed to bias is therefore conditional on that specification. To strengthen the claim, the revised manuscript will include (i) residual plots and formal tests for remaining interactions from the fitted models and (ii) Monte Carlo simulations that inject controlled content-by-bias interactions and recover the original attribution percentages. These additions will be placed in the methods and results sections alongside the existing decomposition. revision: yes
Referee: [response orthogonality definition and reliability prediction] Response orthogonality is defined and used to predict reliability (finding 3), yet the manuscript provides no direct test that the proportion of opposing items is independent of the specific trait scales or LLM prompting regime. If orthogonality itself covaries with item content or model capability, the predictive claim would be circular and the 81-90% bias attribution would be overstated.

Authors: Response orthogonality was computed on the same set of instruments administered to all 56 models, which already span multiple trait domains and a wide capability range; the reported reliability prediction held uniformly across those instruments. Nevertheless, we accept that an explicit check for covariance between orthogonality and either model capability or item features would further rule out circularity. The revision will therefore report (a) the correlation between orthogonality scores and model capability and (b) the correlation between orthogonality and item-level semantic features (e.g., length, polarity). If any modest covariance is detected, we will note its magnitude and re-estimate the reliability prediction after residualizing orthogonality on those covariates. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical variance decomposition on administered instruments

full rationale

The paper's claims rest on direct administration of personality/risk instruments to 56 LLMs and human samples, followed by variance decomposition that partitions observed response variation into trait-targeted vs. directional bias components. The 81-90% bias attribution is an output of this decomposition applied to collected data, not a quantity defined in terms of itself or recovered by fitting parameters to the target result. 'Response orthogonality' is coined as a descriptive proportion of opposing items and then shown to empirically predict reliability; this is a post-hoc correlation on the data, not a self-definitional loop. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation. The analysis is self-contained against external benchmarks of instrument administration and standard psychometric variance partitioning.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper contributes empirical measurements across 56 models and a new analytic concept (response orthogonality) rather than relying on many free parameters or unverified entities; the central analysis rests on the applicability of human psychometric tools to LLMs.

axioms (1)

domain assumption Psychometric instruments designed for humans can be meaningfully administered to LLMs to decompose responses into trait and bias components
The entire analysis and variance decomposition rest on this separation being valid for non-human responders.

invented entities (1)

response orthogonality no independent evidence
purpose: Quantifies the proportion of items for which trait and bias point in opposite directions to predict apparent reliability
Newly coined term introduced to explain why some instruments appear more reliable than others for LLMs.

pith-pipeline@v0.9.1-grok · 5809 in / 1465 out tokens · 35382 ms · 2026-06-26T17:31:30.802845+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 42 canonical work pages · 6 internal anchors

[1]

Accessed May 2026

OpenAI.: Scaling AI for Everyone. Accessed May 2026. https://openai.com/index/scaling-ai-for-e veryone/

2026
[2]

Training language models to follow instructions with human feedback

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems. 2022;35:27730–27744

2022
[3]

Accessed May

Askell A, Carlsmith J, Olah C, Kaplan J, Karnofsky H, et al.: Claude’s Constitution. Accessed May
[4]

https://www.anthropic.com/constitution
[5]

Accessed May 2026

OpenAI.: Model Spec. Accessed May 2026. https://model-spec.openai.com/2025-02-12.html

2026
[6]

Anthropomorphism and Trust in Human-Large Language Model interactions

Kadambi A, D’Elia Y, Shah T, Comsa I, Lentz A, Siri-Ngammuang K, et al. Anthropomorphism and Trust in Human-Large Language Model interactions. arXiv preprint arXiv:260415316. 2026;https: //doi.org/10.48550/arXiv.2604.15316

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.15316 2026
[7]

Be Friendly, Not Friends: How LLM Sycophancy Shapes User Trust

Sun Y, Wang T. Be Friendly, Not Friends: How LLM Sycophancy Shapes User Trust. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems. 2026;https://doi.org/10.114 5/3772318.3791079

arXiv 2026
[8]

How Personality Traits Shape

Hartley J, Hamill CB, Seddon D, Batra D, Okhrati R, Khraishi R. How Personality Traits Shape LLM Risk-Taking Behaviour. Findings of the Association for Computational Linguistics: ACL 2025. 2025;p. 21068–21092. https://doi.org/10.18653/v1/2025.findings-acl.1085

work page doi:10.18653/v1/2025.findings-acl.1085 2025
[9]

Psychometric Personality Shaping Mod- ulates Capabilities and Safety in Language Models

Fitz S, Romero P, Basart S, Chen S, Hernandez-Orallo J. Psychometric Personality Shaping Mod- ulates Capabilities and Safety in Language Models. arXiv preprint arXiv:250916332. 2025;https: //doi.org/10.48550/arXiv.2509.16332

work page doi:10.48550/arxiv.2509.16332 2025
[10]

Out of one, many: Using language models to simulate human samples

Argyle LP, Busby EC, Fulda N, Gubler JR, Rytting C, Wingate D. Out of one, many: Using language models to simulate human samples. Political Analysis. 2023;31(3):337–351. https://doi.org/10.101 7/pan.2023.2

2023
[11]

Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus?

Horton JJ, Filippas A, Manning BS. Large language models as simulated economic agents: What can we learn from homo silicus? [Working Paper]. National Bureau of Economic Research. 2023;https: //doi.org/10.3386/w31122. 14

work page doi:10.3386/w31122 2023
[12]

AI psychometrics: Assessing the psychological profiles of large language models through psychometric inventories

Pellert M, Lechner CM, Wagner C, Rammstedt B, Strohmaier M. AI psychometrics: Assessing the psychological profiles of large language models through psychometric inventories. Perspectives on Psychological Science. 2024;19(5):808–826. https://doi.org/10.1177/17456916231214460

work page doi:10.1177/17456916231214460 2024
[13]

A psychometric framework for evaluating and shaping personality traits in large language models

Serapio-Garc´ ıa G, Safdari M, Crepy C, Sun L, Fitz S, Romero P, et al. A psychometric framework for evaluating and shaping personality traits in large language models. Nature Machine Intelligence. 2025;7(12):1954–1968. https://doi.org/10.1038/s42256-025-01115-6

work page doi:10.1038/s42256-025-01115-6 2025
[14]

Decision-making behavior evaluation framework for LLMs under uncertain context

Jia J, Yuan Z, Pan J, McNamara P, Chen D. Decision-making behavior evaluation framework for LLMs under uncertain context. Advances in Neural Information Processing Systems. 2024;37:113360– 113382. https://doi.org/10.52202/079017-3601

work page doi:10.52202/079017-3601 2024
[15]

Moral Foundations of Large Language Models

Abdulhai M, Serapio-Garc´ ıa G, Crepy C, Valter D, Canny J, Jaques N. Moral Foundations of Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024;p. 17737–17752. https://doi.org/10.18653/v1/2024.emnlp-main.982

work page doi:10.18653/v1/2024.emnlp-main.982 2024
[16]

The political preferences of LLMs

Rozado D. The political preferences of LLMs. PloS one. 2024;19(7):e0306621. https://doi.org/10.1 371/journal.pone.0306621

2024
[17]

Large language model psychometrics: A systematic review of evaluation, validation, and enhancement

Ye H, Jin J, Xie Y, Zhang X, Song G. Large language model psychometrics: A systematic review of evaluation, validation, and enhancement. arXiv preprint arXiv:250508245. 2025;https://doi.org/10 .48550/arXiv.2505.08245

arXiv 2025
[18]

AIPsychoBench: Understanding the Psy- chometric Differences between LLMs and Humans

Xie W, Wang Z, Ma S, Sun X, Chen K, Wang E, et al. AIPsychoBench: Understanding the Psy- chometric Differences between LLMs and Humans. Topics in Cognitive Science. 2026;18(2):e70041. https://doi.org/10.1111/tops.70041

work page doi:10.1111/tops.70041 2026
[19]

Self-Assessment Tests are Unreliable Measures of LLM Person- ality

Gupta A, Song X, Anumanchipalli G. Self-Assessment Tests are Unreliable Measures of LLM Person- ality. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024;p. 301–314. https://doi.org/10.18653/v1/2024.blackboxnlp-1.20

work page doi:10.18653/v1/2024.blackboxnlp-1.20 2024
[20]

You don ' t need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments

Shu B, Zhang L, Choi M, Dunagan L, Logeswaran L, Lee M, et al. You don’t need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ...

work page doi:10.18653/v1/2024.naacl-long.295 2024
[21]

In-Context Retrieval-Augmented Language Models , journal =

Tjuatja L, Chen V, Wu T, Talwalkar A, Neubig G. Do LLMs exhibit human-like response biases? A case study in survey design. Transactions of the Association for Computational Linguistics. 2024;12:1011–1026. https://doi.org/10.1162/tacl a 00685

work page internal anchor Pith review doi:10.1162/tacl 2024
[22]

Cognitive phantoms in large language models through the lens of latent variables

Peereboom S, Schwabe I, Kleinberg B. Cognitive phantoms in large language models through the lens of latent variables. Computers in Human Behavior: Artificial Humans. 2025;4:100161. https: //doi.org/10.1016/j.chbah.2025.100161

work page doi:10.1016/j.chbah.2025.100161 2025
[23]

, title =

S¨ uhr T, Dorner FE, Samadi S, Kelava A. Challenging the Validity of Personality Tests for Large Language Models. Proceedings of the 2025 Equity and Access in Algorithms, Mechanisms, and Optimization. 2025;p. 74–81. https://doi.org/10.1145/3757887.3763016

work page doi:10.1145/3757887.3763016 2025
[24]

Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality

Jung J, Lutz M, Sen I, Strohmaier M. Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2026;p. 8143–8173. https://doi.org/10.18653/v1/2026.eacl-long.380

work page doi:10.18653/v1/2026.eacl-long.380 2026
[25]

Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead

S¨ uhr T, Dorner FE, Salaudeen O, Kelava A, Samadi S. Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. arXiv preprint arXiv:250723009. 2026;https: //doi.org/10.48550/arXiv.2507.23009

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.23009 2026
[26]

Yeasayers and naysayers: Agreeing response set as a personality variable

Couch A, Keniston K. Yeasayers and naysayers: Agreeing response set as a personality variable. The Journal of Abnormal and Social Psychology. 1960;60(2):151–174. https://doi.org/10.1037/h0040372. 15

work page doi:10.1037/h0040372 1960
[27]

Further evidence on response sets and test design

Cronbach LJ. Further evidence on response sets and test design. Educational and Psychological Measurement. 1950;10(1):3–31. https://doi.org/10.1177/001316445001000101

work page doi:10.1177/001316445001000101 1950
[28]

Coefficient alpha and the internal structure of tests

Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16(3):297–

1951
[29]

https://doi.org/10.1007/BF02310555

work page doi:10.1007/bf02310555
[30]

Response Styles in Marketing Research: A Cross-National Inves- tigation

Baumgartner H, Steenkamp JBEM. Response Styles in Marketing Research: A Cross-National Inves- tigation. Journal of Marketing Research. 2001;38(2):143–156. https://doi.org/10.1509/jmkr.38.2.14 3.18840

work page doi:10.1509/jmkr.38.2.14 2001
[31]

A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models

Goldberg LR. A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models. Personality psychology in Europe. 1999;7(1):7–28

1999
[32]

Measuring thirty facets of the Five Factor Model with a 120-item public domain inventory: Development of the IPIP-NEO-120

Johnson JA. Measuring thirty facets of the Five Factor Model with a 120-item public domain inventory: Development of the IPIP-NEO-120. Journal of Research in Personality. 2014;51:78–89. https://doi.org/10.1016/j.jrp.2014.05.003

work page doi:10.1016/j.jrp.2014.05.003 2014
[33]

Risk preference shares the psychometric structure of major psychological traits

Frey R, Pedroni A, Mata R, Rieskamp J, Hertwig R. Risk preference shares the psychometric structure of major psychological traits. Science Advances. 2017;3(10):e1701381. https://doi.org/10 .1126/sciadv.1701381

2017
[34]

https://osf.io/tbmh5/

Johnson JA.: IPIP-NEO Data Repository. https://osf.io/tbmh5/. Open Science Framework
[35]

Efficient Guided Generation for Large Language Models

Willard BT, Louf R. Efficient Guided Generation for Large Language Models. arXiv preprint arXiv:230709702. 2023;https://doi.org/10.48550/arXiv.2307.09702

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09702 2023
[36]

A meta-analytic review of two modes of learning and the description-experience gap

Wulff DU, Mergenthaler-Canseco M, Hertwig R. A meta-analytic review of two modes of learning and the description-experience gap. Psychological bulletin. 2018;144(2):140. https://doi.org/10.103 7/bul0000115

2018
[37]

Positive and negative global self-esteem: A substantively meaningful distinction or artifactors? Journal of personality and social psychology

Marsh HW. Positive and negative global self-esteem: A substantively meaningful distinction or artifactors? Journal of personality and social psychology. 1996;70(4):810. https://doi.org/10.1037/ 0022-3514.70.4.810

1996
[38]

Misresponse to reversed and negated items in surveys: A review

Weijters B, Baumgartner H. Misresponse to reversed and negated items in surveys: A review. Journal of Marketing Research. 2012;49(5):737–747. https://doi.org/10.1509/jmr.11.0368

work page doi:10.1509/jmr.11.0368 2012
[39]

Ineffectiveness of Reverse Wording of Questionnaire Items: Let’s Learn from Cows in the Rain

Sonderen EV, Sanderman R, Coyne JC. Ineffectiveness of Reverse Wording of Questionnaire Items: Let’s Learn from Cows in the Rain. PLoS ONE. 2013;8(7):e68967. https://doi.org/10.1371/journa l.pone.0068967

work page doi:10.1371/journa 2013
[40]

This is not a dataset: A large negation benchmark to challenge large language models

Garc´ ıa-Ferrero I, Altuna B, Alvez J, Gonzalez-Dios I, Rigau G. This is not a dataset: A large negation benchmark to challenge large language models. Proceedings of the 2023 conference on empirical methods in natural language processing. 2023;p. 8596–8615. https://doi.org/10.18653/v1/2023.emn lp-main.531

work page doi:10.18653/v1/2023.emn 2023
[41]

Factors defined by negatively keyed items: The result of careless respondents? Applied Psychological Measurement

Schmitt N, Stults DM. Factors defined by negatively keyed items: The result of careless respondents? Applied Psychological Measurement. 1985;9(4):367–373. https://doi.org/10.1177/01466216850090 0405

work page doi:10.1177/01466216850090 1985
[42]

Training data limits the prediction of consumer heterogeneity by LLM-based digital twins

Krefeld-Schwalb A, Johnson E, Weaver C, Wulff DU. Training data limits the prediction of consumer heterogeneity by LLM-based digital twins. OSF Preprints. 2026;https://doi.org/10.31234/osf.io/97 dya v1

work page doi:10.31234/osf.io/97 2026
[43]

Development and validation of large language model rating scales for automatically transcribed psychological therapy sessions

Eberhardt ST, Vehlen A, Schaffrath J, Schwartz B, Baur T, Schiller D, et al. Development and validation of large language model rating scales for automatically transcribed psychological therapy sessions. Scientific Reports. 2025;15(1):29541. https://doi.org/10.1038/s41598-025-14923-y

work page doi:10.1038/s41598-025-14923-y 2025
[44]

Evaluating and Inducing Personality in Pre-trained Language Models

Jiang G, Xu M, Zhu SC, Han W, Zhang C, Zhu Y. Evaluating and Inducing Personality in Pre-trained Language Models. Advances in Neural Information Processing Systems. 2023;36:10622–10643. 16

2023
[45]

The scientific value of numerical measures of human feelings

Kaiser C, Oswald AJ. The scientific value of numerical measures of human feelings. Proceedings of the National Academy of Sciences. 2022;119(42):e2210412119. https://doi.org/10.1073/pnas.22104 12119

work page doi:10.1073/pnas.22104 2022
[46]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Perez E, Ringer S, Lukosiute K, Nguyen K, Chen E, Heiner S, et al. Discovering Language Model Behaviors with Model-Written Evaluations. Findings of the Association for Computational Linguistics: ACL 2023. 2023;p. 13387–13434. https://doi.org/10.18653/v1/2023.findings-acl.847

work page doi:10.18653/v1/2023.findings-acl.847 2023
[47]

Towards Under- standing Sycophancy in Language Models

Sharma M, Tong M, Korbak T, Duvenaud D, Askell A, Bowman SR, et al. Towards Under- standing Sycophancy in Language Models. The Twelfth International Conference on Learning Representations. 2024;https://openreview.net/forum?id=tvhaxkMKAn

2024
[48]

Using the BIDR to distinguish the effects of impression management and self- deception on the criterion validity of personality measures: A meta-analysis

Li A, Bagger J. Using the BIDR to distinguish the effects of impression management and self- deception on the criterion validity of personality measures: A meta-analysis. International Journal of Selection and Assessment. 2006;14(2):131–141. https://doi.org/10.1111/j.1468-2389.2006.00339.x

work page doi:10.1111/j.1468-2389.2006.00339.x 2006
[49]

How human-like is LLM cognition? OSF Preprints

Hussain Z, Mata R, Wulff DU. How human-like is LLM cognition? OSF Preprints. 2026;https: //doi.org/10.31234/osf.io/2yvnt v2

work page doi:10.31234/osf.io/2yvnt 2026
[50]

Large language models without grounding recover non-sensorimotor but not sensorimotor features of human concepts

Xu Q, Peng Y, Nastase SA, Chodorow M, Wu M, Li P. Large language models without grounding recover non-sensorimotor but not sensorimotor features of human concepts. Nature human behaviour. 2025;9(9):1871–1886. https://doi.org/10.1038/s41562-025-02203-8

work page doi:10.1038/s41562-025-02203-8 2025
[51]

Failure of contextual invariance in large language models

Kumar S, Flint A, Aiello LM, Baronchelli A. Failure of contextual invariance in gender inference with large language models. arXiv preprint arXiv:260323485. 2026;https://doi.org/10.48550/arXiv .2603.23485

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2026
[52]

A rebuttal of two common deflationary stances against LLM cognition

Hussain Z, Mata R, Wulff DU. A rebuttal of two common deflationary stances against LLM cognition. Findings of the Association for Computational Linguistics. 2025;p. 24208–24213. https://doi.org/10 .18653/v1/2025.findings-acl.1242

2025
[53]

Maciejowska, B

Kriegmair V, Wulff DU. Machine individuality: Separating genuine idiosyncrasy from response bias in large language models. arXiv preprint arXiv:260416755. 2026;https://doi.org/10.48550/arXiv.2 604.16755

work page doi:10.48550/arxiv.2 2026
[54]

LMMs-eval: Reality check on the evaluation of large multimodal models

Faulborn M, Sen I, Pellert M, Spitz A, Garcia D. Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. 2025;p. 31684–31704. https://doi.org/10.18653/v1/2025 .acl-long.1529

work page doi:10.18653/v1/2025 2025
[55]

A Domain-Specific Risk-Taking (DOSPERT) scale for adult populations

Blais AR, Weber EU. A Domain-Specific Risk-Taking (DOSPERT) scale for adult populations. Judgment and Decision Making. 2006;1(1):33–47. https://doi.org/10.1017/S1930297500000334

work page doi:10.1017/s1930297500000334 2006
[56]

Evaluation of a behavioral measure of risk taking: The Balloon Analogue Risk Task (BART)

Lejuez CW, Read JP, Kahler CW, Richards JB, Ramsey SE, Stuart GL, et al. Evaluation of a behavioral measure of risk taking: The Balloon Analogue Risk Task (BART). Journal of Experimental Psychology: Applied. 2002;8(2):75–84. https://doi.org/10.1037/1076-898X.8.2.75

work page doi:10.1037/1076-898x.8.2.75 2002
[57]

Is Machine Psychology here? On Requirements for Using Human Psychological Tests on Large Language Models

L¨ ohn L, Kiehne N, Ljapunov A, Balke WT. Is Machine Psychology here? On Requirements for Using Human Psychological Tests on Large Language Models. Proceedings of the 17th International Natural Language Generation Conference. 2024;p. 230–242. https://doi.org/10.18653/v1/2024.inl g-main.19

work page doi:10.18653/v1/2024.inl 2024
[58]

The behavioral and social sciences need open LLMs

Wulff DU, Hussain Z, Mata R. The behavioral and social sciences need open LLMs. OSF Preprints. 2024;https://doi.org/10.31219/osf.io/ybvzs

work page doi:10.31219/osf.io/ybvzs 2024
[59]

Post-training makes large language models less human-like

Binz M, Akata E, Almaatouq A, Alsobay M, Ariasov O, Br¨ andle F, et al. Post-training makes large language models less human-like. arXiv preprint arXiv:260507632. 2026;https://doi.org/10.48550/a rXiv.2605.07632. 17

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/a 2026
[60]

Psychometric Predictive Power of Large Language Models

Kuribayashi T, Oseki Y, Baldwin T. Psychometric Predictive Power of Large Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024;p. 1983–2005. https: //doi.org/10.18653/v1/2024.findings-naacl.129. 18

work page doi:10.18653/v1/2024.findings-naacl.129 2024

[1] [1]

Accessed May 2026

OpenAI.: Scaling AI for Everyone. Accessed May 2026. https://openai.com/index/scaling-ai-for-e veryone/

2026

[2] [2]

Training language models to follow instructions with human feedback

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems. 2022;35:27730–27744

2022

[3] [3]

Accessed May

Askell A, Carlsmith J, Olah C, Kaplan J, Karnofsky H, et al.: Claude’s Constitution. Accessed May

[4] [4]

https://www.anthropic.com/constitution

[5] [5]

Accessed May 2026

OpenAI.: Model Spec. Accessed May 2026. https://model-spec.openai.com/2025-02-12.html

2026

[6] [6]

Anthropomorphism and Trust in Human-Large Language Model interactions

Kadambi A, D’Elia Y, Shah T, Comsa I, Lentz A, Siri-Ngammuang K, et al. Anthropomorphism and Trust in Human-Large Language Model interactions. arXiv preprint arXiv:260415316. 2026;https: //doi.org/10.48550/arXiv.2604.15316

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.15316 2026

[7] [7]

Be Friendly, Not Friends: How LLM Sycophancy Shapes User Trust

Sun Y, Wang T. Be Friendly, Not Friends: How LLM Sycophancy Shapes User Trust. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems. 2026;https://doi.org/10.114 5/3772318.3791079

arXiv 2026

[8] [8]

How Personality Traits Shape

Hartley J, Hamill CB, Seddon D, Batra D, Okhrati R, Khraishi R. How Personality Traits Shape LLM Risk-Taking Behaviour. Findings of the Association for Computational Linguistics: ACL 2025. 2025;p. 21068–21092. https://doi.org/10.18653/v1/2025.findings-acl.1085

work page doi:10.18653/v1/2025.findings-acl.1085 2025

[9] [9]

Psychometric Personality Shaping Mod- ulates Capabilities and Safety in Language Models

Fitz S, Romero P, Basart S, Chen S, Hernandez-Orallo J. Psychometric Personality Shaping Mod- ulates Capabilities and Safety in Language Models. arXiv preprint arXiv:250916332. 2025;https: //doi.org/10.48550/arXiv.2509.16332

work page doi:10.48550/arxiv.2509.16332 2025

[10] [10]

Out of one, many: Using language models to simulate human samples

Argyle LP, Busby EC, Fulda N, Gubler JR, Rytting C, Wingate D. Out of one, many: Using language models to simulate human samples. Political Analysis. 2023;31(3):337–351. https://doi.org/10.101 7/pan.2023.2

2023

[11] [11]

Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus?

Horton JJ, Filippas A, Manning BS. Large language models as simulated economic agents: What can we learn from homo silicus? [Working Paper]. National Bureau of Economic Research. 2023;https: //doi.org/10.3386/w31122. 14

work page doi:10.3386/w31122 2023

[12] [12]

AI psychometrics: Assessing the psychological profiles of large language models through psychometric inventories

Pellert M, Lechner CM, Wagner C, Rammstedt B, Strohmaier M. AI psychometrics: Assessing the psychological profiles of large language models through psychometric inventories. Perspectives on Psychological Science. 2024;19(5):808–826. https://doi.org/10.1177/17456916231214460

work page doi:10.1177/17456916231214460 2024

[13] [13]

A psychometric framework for evaluating and shaping personality traits in large language models

Serapio-Garc´ ıa G, Safdari M, Crepy C, Sun L, Fitz S, Romero P, et al. A psychometric framework for evaluating and shaping personality traits in large language models. Nature Machine Intelligence. 2025;7(12):1954–1968. https://doi.org/10.1038/s42256-025-01115-6

work page doi:10.1038/s42256-025-01115-6 2025

[14] [14]

Decision-making behavior evaluation framework for LLMs under uncertain context

Jia J, Yuan Z, Pan J, McNamara P, Chen D. Decision-making behavior evaluation framework for LLMs under uncertain context. Advances in Neural Information Processing Systems. 2024;37:113360– 113382. https://doi.org/10.52202/079017-3601

work page doi:10.52202/079017-3601 2024

[15] [15]

Moral Foundations of Large Language Models

Abdulhai M, Serapio-Garc´ ıa G, Crepy C, Valter D, Canny J, Jaques N. Moral Foundations of Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024;p. 17737–17752. https://doi.org/10.18653/v1/2024.emnlp-main.982

work page doi:10.18653/v1/2024.emnlp-main.982 2024

[16] [16]

The political preferences of LLMs

Rozado D. The political preferences of LLMs. PloS one. 2024;19(7):e0306621. https://doi.org/10.1 371/journal.pone.0306621

2024

[17] [17]

Large language model psychometrics: A systematic review of evaluation, validation, and enhancement

Ye H, Jin J, Xie Y, Zhang X, Song G. Large language model psychometrics: A systematic review of evaluation, validation, and enhancement. arXiv preprint arXiv:250508245. 2025;https://doi.org/10 .48550/arXiv.2505.08245

arXiv 2025

[18] [18]

AIPsychoBench: Understanding the Psy- chometric Differences between LLMs and Humans

Xie W, Wang Z, Ma S, Sun X, Chen K, Wang E, et al. AIPsychoBench: Understanding the Psy- chometric Differences between LLMs and Humans. Topics in Cognitive Science. 2026;18(2):e70041. https://doi.org/10.1111/tops.70041

work page doi:10.1111/tops.70041 2026

[19] [19]

Self-Assessment Tests are Unreliable Measures of LLM Person- ality

Gupta A, Song X, Anumanchipalli G. Self-Assessment Tests are Unreliable Measures of LLM Person- ality. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024;p. 301–314. https://doi.org/10.18653/v1/2024.blackboxnlp-1.20

work page doi:10.18653/v1/2024.blackboxnlp-1.20 2024

[20] [20]

You don ' t need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments

Shu B, Zhang L, Choi M, Dunagan L, Logeswaran L, Lee M, et al. You don’t need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ...

work page doi:10.18653/v1/2024.naacl-long.295 2024

[21] [21]

In-Context Retrieval-Augmented Language Models , journal =

Tjuatja L, Chen V, Wu T, Talwalkar A, Neubig G. Do LLMs exhibit human-like response biases? A case study in survey design. Transactions of the Association for Computational Linguistics. 2024;12:1011–1026. https://doi.org/10.1162/tacl a 00685

work page internal anchor Pith review doi:10.1162/tacl 2024

[22] [22]

Cognitive phantoms in large language models through the lens of latent variables

Peereboom S, Schwabe I, Kleinberg B. Cognitive phantoms in large language models through the lens of latent variables. Computers in Human Behavior: Artificial Humans. 2025;4:100161. https: //doi.org/10.1016/j.chbah.2025.100161

work page doi:10.1016/j.chbah.2025.100161 2025

[23] [23]

, title =

S¨ uhr T, Dorner FE, Samadi S, Kelava A. Challenging the Validity of Personality Tests for Large Language Models. Proceedings of the 2025 Equity and Access in Algorithms, Mechanisms, and Optimization. 2025;p. 74–81. https://doi.org/10.1145/3757887.3763016

work page doi:10.1145/3757887.3763016 2025

[24] [24]

Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality

Jung J, Lutz M, Sen I, Strohmaier M. Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2026;p. 8143–8173. https://doi.org/10.18653/v1/2026.eacl-long.380

work page doi:10.18653/v1/2026.eacl-long.380 2026

[25] [25]

Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead

S¨ uhr T, Dorner FE, Salaudeen O, Kelava A, Samadi S. Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. arXiv preprint arXiv:250723009. 2026;https: //doi.org/10.48550/arXiv.2507.23009

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.23009 2026

[26] [26]

Yeasayers and naysayers: Agreeing response set as a personality variable

Couch A, Keniston K. Yeasayers and naysayers: Agreeing response set as a personality variable. The Journal of Abnormal and Social Psychology. 1960;60(2):151–174. https://doi.org/10.1037/h0040372. 15

work page doi:10.1037/h0040372 1960

[27] [27]

Further evidence on response sets and test design

Cronbach LJ. Further evidence on response sets and test design. Educational and Psychological Measurement. 1950;10(1):3–31. https://doi.org/10.1177/001316445001000101

work page doi:10.1177/001316445001000101 1950

[28] [28]

Coefficient alpha and the internal structure of tests

Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16(3):297–

1951

[29] [29]

https://doi.org/10.1007/BF02310555

work page doi:10.1007/bf02310555

[30] [30]

Response Styles in Marketing Research: A Cross-National Inves- tigation

Baumgartner H, Steenkamp JBEM. Response Styles in Marketing Research: A Cross-National Inves- tigation. Journal of Marketing Research. 2001;38(2):143–156. https://doi.org/10.1509/jmkr.38.2.14 3.18840

work page doi:10.1509/jmkr.38.2.14 2001

[31] [31]

A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models

Goldberg LR. A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models. Personality psychology in Europe. 1999;7(1):7–28

1999

[32] [32]

Measuring thirty facets of the Five Factor Model with a 120-item public domain inventory: Development of the IPIP-NEO-120

Johnson JA. Measuring thirty facets of the Five Factor Model with a 120-item public domain inventory: Development of the IPIP-NEO-120. Journal of Research in Personality. 2014;51:78–89. https://doi.org/10.1016/j.jrp.2014.05.003

work page doi:10.1016/j.jrp.2014.05.003 2014

[33] [33]

Risk preference shares the psychometric structure of major psychological traits

Frey R, Pedroni A, Mata R, Rieskamp J, Hertwig R. Risk preference shares the psychometric structure of major psychological traits. Science Advances. 2017;3(10):e1701381. https://doi.org/10 .1126/sciadv.1701381

2017

[34] [34]

https://osf.io/tbmh5/

Johnson JA.: IPIP-NEO Data Repository. https://osf.io/tbmh5/. Open Science Framework

[35] [35]

Efficient Guided Generation for Large Language Models

Willard BT, Louf R. Efficient Guided Generation for Large Language Models. arXiv preprint arXiv:230709702. 2023;https://doi.org/10.48550/arXiv.2307.09702

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09702 2023

[36] [36]

A meta-analytic review of two modes of learning and the description-experience gap

Wulff DU, Mergenthaler-Canseco M, Hertwig R. A meta-analytic review of two modes of learning and the description-experience gap. Psychological bulletin. 2018;144(2):140. https://doi.org/10.103 7/bul0000115

2018

[37] [37]

Positive and negative global self-esteem: A substantively meaningful distinction or artifactors? Journal of personality and social psychology

Marsh HW. Positive and negative global self-esteem: A substantively meaningful distinction or artifactors? Journal of personality and social psychology. 1996;70(4):810. https://doi.org/10.1037/ 0022-3514.70.4.810

1996

[38] [38]

Misresponse to reversed and negated items in surveys: A review

Weijters B, Baumgartner H. Misresponse to reversed and negated items in surveys: A review. Journal of Marketing Research. 2012;49(5):737–747. https://doi.org/10.1509/jmr.11.0368

work page doi:10.1509/jmr.11.0368 2012

[39] [39]

Ineffectiveness of Reverse Wording of Questionnaire Items: Let’s Learn from Cows in the Rain

Sonderen EV, Sanderman R, Coyne JC. Ineffectiveness of Reverse Wording of Questionnaire Items: Let’s Learn from Cows in the Rain. PLoS ONE. 2013;8(7):e68967. https://doi.org/10.1371/journa l.pone.0068967

work page doi:10.1371/journa 2013

[40] [40]

This is not a dataset: A large negation benchmark to challenge large language models

Garc´ ıa-Ferrero I, Altuna B, Alvez J, Gonzalez-Dios I, Rigau G. This is not a dataset: A large negation benchmark to challenge large language models. Proceedings of the 2023 conference on empirical methods in natural language processing. 2023;p. 8596–8615. https://doi.org/10.18653/v1/2023.emn lp-main.531

work page doi:10.18653/v1/2023.emn 2023

[41] [41]

Factors defined by negatively keyed items: The result of careless respondents? Applied Psychological Measurement

Schmitt N, Stults DM. Factors defined by negatively keyed items: The result of careless respondents? Applied Psychological Measurement. 1985;9(4):367–373. https://doi.org/10.1177/01466216850090 0405

work page doi:10.1177/01466216850090 1985

[42] [42]

Training data limits the prediction of consumer heterogeneity by LLM-based digital twins

Krefeld-Schwalb A, Johnson E, Weaver C, Wulff DU. Training data limits the prediction of consumer heterogeneity by LLM-based digital twins. OSF Preprints. 2026;https://doi.org/10.31234/osf.io/97 dya v1

work page doi:10.31234/osf.io/97 2026

[43] [43]

Development and validation of large language model rating scales for automatically transcribed psychological therapy sessions

Eberhardt ST, Vehlen A, Schaffrath J, Schwartz B, Baur T, Schiller D, et al. Development and validation of large language model rating scales for automatically transcribed psychological therapy sessions. Scientific Reports. 2025;15(1):29541. https://doi.org/10.1038/s41598-025-14923-y

work page doi:10.1038/s41598-025-14923-y 2025

[44] [44]

Evaluating and Inducing Personality in Pre-trained Language Models

Jiang G, Xu M, Zhu SC, Han W, Zhang C, Zhu Y. Evaluating and Inducing Personality in Pre-trained Language Models. Advances in Neural Information Processing Systems. 2023;36:10622–10643. 16

2023

[45] [45]

The scientific value of numerical measures of human feelings

Kaiser C, Oswald AJ. The scientific value of numerical measures of human feelings. Proceedings of the National Academy of Sciences. 2022;119(42):e2210412119. https://doi.org/10.1073/pnas.22104 12119

work page doi:10.1073/pnas.22104 2022

[46] [46]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Perez E, Ringer S, Lukosiute K, Nguyen K, Chen E, Heiner S, et al. Discovering Language Model Behaviors with Model-Written Evaluations. Findings of the Association for Computational Linguistics: ACL 2023. 2023;p. 13387–13434. https://doi.org/10.18653/v1/2023.findings-acl.847

work page doi:10.18653/v1/2023.findings-acl.847 2023

[47] [47]

Towards Under- standing Sycophancy in Language Models

Sharma M, Tong M, Korbak T, Duvenaud D, Askell A, Bowman SR, et al. Towards Under- standing Sycophancy in Language Models. The Twelfth International Conference on Learning Representations. 2024;https://openreview.net/forum?id=tvhaxkMKAn

2024

[48] [48]

Using the BIDR to distinguish the effects of impression management and self- deception on the criterion validity of personality measures: A meta-analysis

Li A, Bagger J. Using the BIDR to distinguish the effects of impression management and self- deception on the criterion validity of personality measures: A meta-analysis. International Journal of Selection and Assessment. 2006;14(2):131–141. https://doi.org/10.1111/j.1468-2389.2006.00339.x

work page doi:10.1111/j.1468-2389.2006.00339.x 2006

[49] [49]

How human-like is LLM cognition? OSF Preprints

Hussain Z, Mata R, Wulff DU. How human-like is LLM cognition? OSF Preprints. 2026;https: //doi.org/10.31234/osf.io/2yvnt v2

work page doi:10.31234/osf.io/2yvnt 2026

[50] [50]

Large language models without grounding recover non-sensorimotor but not sensorimotor features of human concepts

Xu Q, Peng Y, Nastase SA, Chodorow M, Wu M, Li P. Large language models without grounding recover non-sensorimotor but not sensorimotor features of human concepts. Nature human behaviour. 2025;9(9):1871–1886. https://doi.org/10.1038/s41562-025-02203-8

work page doi:10.1038/s41562-025-02203-8 2025

[51] [51]

Failure of contextual invariance in large language models

Kumar S, Flint A, Aiello LM, Baronchelli A. Failure of contextual invariance in gender inference with large language models. arXiv preprint arXiv:260323485. 2026;https://doi.org/10.48550/arXiv .2603.23485

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2026

[52] [52]

A rebuttal of two common deflationary stances against LLM cognition

Hussain Z, Mata R, Wulff DU. A rebuttal of two common deflationary stances against LLM cognition. Findings of the Association for Computational Linguistics. 2025;p. 24208–24213. https://doi.org/10 .18653/v1/2025.findings-acl.1242

2025

[53] [53]

Maciejowska, B

Kriegmair V, Wulff DU. Machine individuality: Separating genuine idiosyncrasy from response bias in large language models. arXiv preprint arXiv:260416755. 2026;https://doi.org/10.48550/arXiv.2 604.16755

work page doi:10.48550/arxiv.2 2026

[54] [54]

LMMs-eval: Reality check on the evaluation of large multimodal models

Faulborn M, Sen I, Pellert M, Spitz A, Garcia D. Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. 2025;p. 31684–31704. https://doi.org/10.18653/v1/2025 .acl-long.1529

work page doi:10.18653/v1/2025 2025

[55] [55]

A Domain-Specific Risk-Taking (DOSPERT) scale for adult populations

Blais AR, Weber EU. A Domain-Specific Risk-Taking (DOSPERT) scale for adult populations. Judgment and Decision Making. 2006;1(1):33–47. https://doi.org/10.1017/S1930297500000334

work page doi:10.1017/s1930297500000334 2006

[56] [56]

Evaluation of a behavioral measure of risk taking: The Balloon Analogue Risk Task (BART)

Lejuez CW, Read JP, Kahler CW, Richards JB, Ramsey SE, Stuart GL, et al. Evaluation of a behavioral measure of risk taking: The Balloon Analogue Risk Task (BART). Journal of Experimental Psychology: Applied. 2002;8(2):75–84. https://doi.org/10.1037/1076-898X.8.2.75

work page doi:10.1037/1076-898x.8.2.75 2002

[57] [57]

Is Machine Psychology here? On Requirements for Using Human Psychological Tests on Large Language Models

L¨ ohn L, Kiehne N, Ljapunov A, Balke WT. Is Machine Psychology here? On Requirements for Using Human Psychological Tests on Large Language Models. Proceedings of the 17th International Natural Language Generation Conference. 2024;p. 230–242. https://doi.org/10.18653/v1/2024.inl g-main.19

work page doi:10.18653/v1/2024.inl 2024

[58] [58]

The behavioral and social sciences need open LLMs

Wulff DU, Hussain Z, Mata R. The behavioral and social sciences need open LLMs. OSF Preprints. 2024;https://doi.org/10.31219/osf.io/ybvzs

work page doi:10.31219/osf.io/ybvzs 2024

[59] [59]

Post-training makes large language models less human-like

Binz M, Akata E, Almaatouq A, Alsobay M, Ariasov O, Br¨ andle F, et al. Post-training makes large language models less human-like. arXiv preprint arXiv:260507632. 2026;https://doi.org/10.48550/a rXiv.2605.07632. 17

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/a 2026

[60] [60]

Psychometric Predictive Power of Large Language Models

Kuribayashi T, Oseki Y, Baldwin T. Psychometric Predictive Power of Large Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024;p. 1983–2005. https: //doi.org/10.18653/v1/2024.findings-naacl.129. 18

work page doi:10.18653/v1/2024.findings-naacl.129 2024