pith. sign in

arxiv: 2606.20205 · v1 · pith:VQLGUZ4Dnew · submitted 2026-06-18 · 💻 cs.AI · cs.CL· cs.HC

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

Pith reviewed 2026-06-26 17:31 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.HC
keywords large language modelspsychological profilesmeasurement artifactdirectional response biaspsychometric frameworkresponse orthogonalityvariance decomposition
0
0 comments X

The pith

Apparent psychological profiles of large language models arise mostly from directional response bias rather than the traits the tests target.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper administers personality and risk-preference instruments to 56 instruction-tuned LLMs and human samples, then applies a formal psychometric variance decomposition. It finds that between-model differences are driven by a directional response bias toward one scale end or labeled option, accounting for 81-90 percent of variation versus 9-16 percent in humans. This bias declines with model capability but remains. Because bias dominates, an instrument's apparent reliability is predicted almost entirely by its response orthogonality—the share of items where trait and bias point oppositely—and model profiles shift or can be manufactured by changing the item set. The result is that borrowed human instruments produce artifacts rather than valid LLM traits.

Core claim

Using a battery of self-report and behavioral instruments, the authors show that differences between LLMs are driven not by targeted traits but by directional response bias; variance decomposition attributes 81-90% of between-model variation to this bias. The bias declines with capability but is not eliminated. Instrument reliability is almost entirely predicted by response orthogonality, and profiles shift with item selection or can be manufactured through it.

What carries the argument

Directional response bias (tendency to respond toward one end of the scale regardless of item content), isolated through variance decomposition within a formal psychometric framework applied to self-report and behavioral tasks.

If this is right

  • Apparent reliability of any instrument for LLMs is almost entirely predicted by its response orthogonality.
  • Model profiles shift with the items used and can be manufactured through item selection.
  • The bias declines with model capability but is not eliminated by it.
  • Instruments borrowed from human psychology rarely achieve full orthogonality and may lack validity for LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Dedicated LLM assessments should be built around response orthogonality rather than direct adoption of human scales.
  • Using LLMs as proxies for human participants in psychological research may introduce systematic artifacts traceable to response bias.
  • Similar directional biases could affect other survey-style evaluations of AI systems beyond personality and risk measures.

Load-bearing premise

Directional response bias can be cleanly separated from the targeted trait effects by the chosen instruments and the variance decomposition method.

What would settle it

A replication in which the same variance decomposition on new LLMs and instruments attributes most between-model variation to the intended traits rather than directional bias, or in which model profiles remain stable when items are swapped for others measuring the same trait.

read the original abstract

Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument's apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper administers a battery of personality and risk-preference instruments (self-report and behavioral) to 56 instruction-tuned LLMs and human reference samples. It reports four findings: (1) between-model differences are driven primarily by directional response bias (tendency toward one scale end or labeled option, independent of item content), with a variance decomposition attributing 81-90% of between-model variation to this bias versus 9-16% in humans; (2) the bias declines with model capability but is not eliminated; (3) instrument reliability is almost entirely predicted by 'response orthogonality' (proportion of items where trait and bias oppose); (4) apparent profiles shift with item selection and can be manufactured. The conclusion is that LLM psychological profiles are measurement artifacts, not model properties, and calls for new assessments centered on response orthogonality.

Significance. If the variance decomposition and orthogonality results hold under scrutiny, the work would substantially undermine the use of borrowed human psychological instruments for profiling LLMs in usability, safety, and research-proxy contexts. It supplies concrete empirical numbers from a large model sample and introduces a falsifiable construct (response orthogonality) that directly predicts reliability, offering a clear path for improved instrumentation.

major comments (2)
  1. [variance decomposition / findings 1] The central variance decomposition (abstract and § on findings) attributes 81-90% of between-model variation to directional bias under the assumption that bias is additive and orthogonal to trait effects. For LLMs this isolation is not obviously guaranteed, because next-token generation conditions on full item semantics and scale labels; any content-by-bias interaction would be misattributed to pure directional bias. Explicit residual diagnostics or simulation checks confirming orthogonality across the administered items are required to support the reported percentages.
  2. [response orthogonality definition and reliability prediction] Response orthogonality is defined and used to predict reliability (finding 3), yet the manuscript provides no direct test that the proportion of opposing items is independent of the specific trait scales or LLM prompting regime. If orthogonality itself covaries with item content or model capability, the predictive claim would be circular and the 81-90% bias attribution would be overstated.
minor comments (2)
  1. [methods] The abstract states four findings but the methods description of how bias was isolated from trait effects and how orthogonality was computed is referenced only at high level; a dedicated methods subsection with item lists, exact variance formulas, and human-LLM comparison tables would improve reproducibility.
  2. [finding 4] The claim that profiles 'can be manufactured through item selection' (finding 4) is important but would benefit from a supplementary table showing the exact item subsets and resulting profile shifts for at least two instruments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments on the assumptions underlying our variance decomposition and the independence of the response orthogonality measure. We respond to each point below.

read point-by-point responses
  1. Referee: [variance decomposition / findings 1] The central variance decomposition (abstract and § on findings) attributes 81-90% of between-model variation to directional bias under the assumption that bias is additive and orthogonal to trait effects. For LLMs this isolation is not obviously guaranteed, because next-token generation conditions on full item semantics and scale labels; any content-by-bias interaction would be misattributed to pure directional bias. Explicit residual diagnostics or simulation checks confirming orthogonality across the administered items are required to support the reported percentages.

    Authors: We agree that the reported variance decomposition assumes additivity and orthogonality of directional bias to trait effects, and that next-token prediction could in principle introduce content-by-bias interactions. Our decomposition was obtained by fitting a linear model that separates an item-specific trait direction term from a model-specific directional bias term; the large between-model variance component attributed to bias is therefore conditional on that specification. To strengthen the claim, the revised manuscript will include (i) residual plots and formal tests for remaining interactions from the fitted models and (ii) Monte Carlo simulations that inject controlled content-by-bias interactions and recover the original attribution percentages. These additions will be placed in the methods and results sections alongside the existing decomposition. revision: yes

  2. Referee: [response orthogonality definition and reliability prediction] Response orthogonality is defined and used to predict reliability (finding 3), yet the manuscript provides no direct test that the proportion of opposing items is independent of the specific trait scales or LLM prompting regime. If orthogonality itself covaries with item content or model capability, the predictive claim would be circular and the 81-90% bias attribution would be overstated.

    Authors: Response orthogonality was computed on the same set of instruments administered to all 56 models, which already span multiple trait domains and a wide capability range; the reported reliability prediction held uniformly across those instruments. Nevertheless, we accept that an explicit check for covariance between orthogonality and either model capability or item features would further rule out circularity. The revision will therefore report (a) the correlation between orthogonality scores and model capability and (b) the correlation between orthogonality and item-level semantic features (e.g., length, polarity). If any modest covariance is detected, we will note its magnitude and re-estimate the reliability prediction after residualizing orthogonality on those covariates. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical variance decomposition on administered instruments

full rationale

The paper's claims rest on direct administration of personality/risk instruments to 56 LLMs and human samples, followed by variance decomposition that partitions observed response variation into trait-targeted vs. directional bias components. The 81-90% bias attribution is an output of this decomposition applied to collected data, not a quantity defined in terms of itself or recovered by fitting parameters to the target result. 'Response orthogonality' is coined as a descriptive proportion of opposing items and then shown to empirically predict reliability; this is a post-hoc correlation on the data, not a self-definitional loop. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation. The analysis is self-contained against external benchmarks of instrument administration and standard psychometric variance partitioning.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper contributes empirical measurements across 56 models and a new analytic concept (response orthogonality) rather than relying on many free parameters or unverified entities; the central analysis rests on the applicability of human psychometric tools to LLMs.

axioms (1)
  • domain assumption Psychometric instruments designed for humans can be meaningfully administered to LLMs to decompose responses into trait and bias components
    The entire analysis and variance decomposition rest on this separation being valid for non-human responders.
invented entities (1)
  • response orthogonality no independent evidence
    purpose: Quantifies the proportion of items for which trait and bias point in opposite directions to predict apparent reliability
    Newly coined term introduced to explain why some instruments appear more reliable than others for LLMs.

pith-pipeline@v0.9.1-grok · 5809 in / 1465 out tokens · 35382 ms · 2026-06-26T17:31:30.802845+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 42 canonical work pages · 6 internal anchors

  1. [1]

    Accessed May 2026

    OpenAI.: Scaling AI for Everyone. Accessed May 2026. https://openai.com/index/scaling-ai-for-e veryone/

  2. [2]

    Training language models to follow instructions with human feedback

    Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems. 2022;35:27730–27744

  3. [3]

    Accessed May

    Askell A, Carlsmith J, Olah C, Kaplan J, Karnofsky H, et al.: Claude’s Constitution. Accessed May

  4. [4]

    https://www.anthropic.com/constitution

  5. [5]

    Accessed May 2026

    OpenAI.: Model Spec. Accessed May 2026. https://model-spec.openai.com/2025-02-12.html

  6. [6]

    Anthropomorphism and Trust in Human-Large Language Model interactions

    Kadambi A, D’Elia Y, Shah T, Comsa I, Lentz A, Siri-Ngammuang K, et al. Anthropomorphism and Trust in Human-Large Language Model interactions. arXiv preprint arXiv:260415316. 2026;https: //doi.org/10.48550/arXiv.2604.15316

  7. [7]

    Be Friendly, Not Friends: How LLM Sycophancy Shapes User Trust

    Sun Y, Wang T. Be Friendly, Not Friends: How LLM Sycophancy Shapes User Trust. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems. 2026;https://doi.org/10.114 5/3772318.3791079

  8. [8]

    How Personality Traits Shape

    Hartley J, Hamill CB, Seddon D, Batra D, Okhrati R, Khraishi R. How Personality Traits Shape LLM Risk-Taking Behaviour. Findings of the Association for Computational Linguistics: ACL 2025. 2025;p. 21068–21092. https://doi.org/10.18653/v1/2025.findings-acl.1085

  9. [9]

    Psychometric Personality Shaping Mod- ulates Capabilities and Safety in Language Models

    Fitz S, Romero P, Basart S, Chen S, Hernandez-Orallo J. Psychometric Personality Shaping Mod- ulates Capabilities and Safety in Language Models. arXiv preprint arXiv:250916332. 2025;https: //doi.org/10.48550/arXiv.2509.16332

  10. [10]

    Out of one, many: Using language models to simulate human samples

    Argyle LP, Busby EC, Fulda N, Gubler JR, Rytting C, Wingate D. Out of one, many: Using language models to simulate human samples. Political Analysis. 2023;31(3):337–351. https://doi.org/10.101 7/pan.2023.2

  11. [11]

    Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus?

    Horton JJ, Filippas A, Manning BS. Large language models as simulated economic agents: What can we learn from homo silicus? [Working Paper]. National Bureau of Economic Research. 2023;https: //doi.org/10.3386/w31122. 14

  12. [12]

    AI psychometrics: Assessing the psychological profiles of large language models through psychometric inventories

    Pellert M, Lechner CM, Wagner C, Rammstedt B, Strohmaier M. AI psychometrics: Assessing the psychological profiles of large language models through psychometric inventories. Perspectives on Psychological Science. 2024;19(5):808–826. https://doi.org/10.1177/17456916231214460

  13. [13]

    A psychometric framework for evaluating and shaping personality traits in large language models

    Serapio-Garc´ ıa G, Safdari M, Crepy C, Sun L, Fitz S, Romero P, et al. A psychometric framework for evaluating and shaping personality traits in large language models. Nature Machine Intelligence. 2025;7(12):1954–1968. https://doi.org/10.1038/s42256-025-01115-6

  14. [14]

    Decision-making behavior evaluation framework for LLMs under uncertain context

    Jia J, Yuan Z, Pan J, McNamara P, Chen D. Decision-making behavior evaluation framework for LLMs under uncertain context. Advances in Neural Information Processing Systems. 2024;37:113360– 113382. https://doi.org/10.52202/079017-3601

  15. [15]

    Moral Foundations of Large Language Models

    Abdulhai M, Serapio-Garc´ ıa G, Crepy C, Valter D, Canny J, Jaques N. Moral Foundations of Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024;p. 17737–17752. https://doi.org/10.18653/v1/2024.emnlp-main.982

  16. [16]

    The political preferences of LLMs

    Rozado D. The political preferences of LLMs. PloS one. 2024;19(7):e0306621. https://doi.org/10.1 371/journal.pone.0306621

  17. [17]

    Large language model psychometrics: A systematic review of evaluation, validation, and enhancement

    Ye H, Jin J, Xie Y, Zhang X, Song G. Large language model psychometrics: A systematic review of evaluation, validation, and enhancement. arXiv preprint arXiv:250508245. 2025;https://doi.org/10 .48550/arXiv.2505.08245

  18. [18]

    AIPsychoBench: Understanding the Psy- chometric Differences between LLMs and Humans

    Xie W, Wang Z, Ma S, Sun X, Chen K, Wang E, et al. AIPsychoBench: Understanding the Psy- chometric Differences between LLMs and Humans. Topics in Cognitive Science. 2026;18(2):e70041. https://doi.org/10.1111/tops.70041

  19. [19]

    Self-Assessment Tests are Unreliable Measures of LLM Person- ality

    Gupta A, Song X, Anumanchipalli G. Self-Assessment Tests are Unreliable Measures of LLM Person- ality. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024;p. 301–314. https://doi.org/10.18653/v1/2024.blackboxnlp-1.20

  20. [20]

    You don ' t need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments

    Shu B, Zhang L, Choi M, Dunagan L, Logeswaran L, Lee M, et al. You don’t need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ...

  21. [21]

    In-Context Retrieval-Augmented Language Models , journal =

    Tjuatja L, Chen V, Wu T, Talwalkar A, Neubig G. Do LLMs exhibit human-like response biases? A case study in survey design. Transactions of the Association for Computational Linguistics. 2024;12:1011–1026. https://doi.org/10.1162/tacl a 00685

  22. [22]

    Cognitive phantoms in large language models through the lens of latent variables

    Peereboom S, Schwabe I, Kleinberg B. Cognitive phantoms in large language models through the lens of latent variables. Computers in Human Behavior: Artificial Humans. 2025;4:100161. https: //doi.org/10.1016/j.chbah.2025.100161

  23. [23]

    , title =

    S¨ uhr T, Dorner FE, Samadi S, Kelava A. Challenging the Validity of Personality Tests for Large Language Models. Proceedings of the 2025 Equity and Access in Algorithms, Mechanisms, and Optimization. 2025;p. 74–81. https://doi.org/10.1145/3757887.3763016

  24. [24]

    Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality

    Jung J, Lutz M, Sen I, Strohmaier M. Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2026;p. 8143–8173. https://doi.org/10.18653/v1/2026.eacl-long.380

  25. [25]

    Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead

    S¨ uhr T, Dorner FE, Salaudeen O, Kelava A, Samadi S. Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. arXiv preprint arXiv:250723009. 2026;https: //doi.org/10.48550/arXiv.2507.23009

  26. [26]

    Yeasayers and naysayers: Agreeing response set as a personality variable

    Couch A, Keniston K. Yeasayers and naysayers: Agreeing response set as a personality variable. The Journal of Abnormal and Social Psychology. 1960;60(2):151–174. https://doi.org/10.1037/h0040372. 15

  27. [27]

    Further evidence on response sets and test design

    Cronbach LJ. Further evidence on response sets and test design. Educational and Psychological Measurement. 1950;10(1):3–31. https://doi.org/10.1177/001316445001000101

  28. [28]

    Coefficient alpha and the internal structure of tests

    Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16(3):297–

  29. [29]

    https://doi.org/10.1007/BF02310555

  30. [30]

    Response Styles in Marketing Research: A Cross-National Inves- tigation

    Baumgartner H, Steenkamp JBEM. Response Styles in Marketing Research: A Cross-National Inves- tigation. Journal of Marketing Research. 2001;38(2):143–156. https://doi.org/10.1509/jmkr.38.2.14 3.18840

  31. [31]

    A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models

    Goldberg LR. A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models. Personality psychology in Europe. 1999;7(1):7–28

  32. [32]

    Measuring thirty facets of the Five Factor Model with a 120-item public domain inventory: Development of the IPIP-NEO-120

    Johnson JA. Measuring thirty facets of the Five Factor Model with a 120-item public domain inventory: Development of the IPIP-NEO-120. Journal of Research in Personality. 2014;51:78–89. https://doi.org/10.1016/j.jrp.2014.05.003

  33. [33]

    Risk preference shares the psychometric structure of major psychological traits

    Frey R, Pedroni A, Mata R, Rieskamp J, Hertwig R. Risk preference shares the psychometric structure of major psychological traits. Science Advances. 2017;3(10):e1701381. https://doi.org/10 .1126/sciadv.1701381

  34. [34]

    https://osf.io/tbmh5/

    Johnson JA.: IPIP-NEO Data Repository. https://osf.io/tbmh5/. Open Science Framework

  35. [35]

    Efficient Guided Generation for Large Language Models

    Willard BT, Louf R. Efficient Guided Generation for Large Language Models. arXiv preprint arXiv:230709702. 2023;https://doi.org/10.48550/arXiv.2307.09702

  36. [36]

    A meta-analytic review of two modes of learning and the description-experience gap

    Wulff DU, Mergenthaler-Canseco M, Hertwig R. A meta-analytic review of two modes of learning and the description-experience gap. Psychological bulletin. 2018;144(2):140. https://doi.org/10.103 7/bul0000115

  37. [37]

    Positive and negative global self-esteem: A substantively meaningful distinction or artifactors? Journal of personality and social psychology

    Marsh HW. Positive and negative global self-esteem: A substantively meaningful distinction or artifactors? Journal of personality and social psychology. 1996;70(4):810. https://doi.org/10.1037/ 0022-3514.70.4.810

  38. [38]

    Misresponse to reversed and negated items in surveys: A review

    Weijters B, Baumgartner H. Misresponse to reversed and negated items in surveys: A review. Journal of Marketing Research. 2012;49(5):737–747. https://doi.org/10.1509/jmr.11.0368

  39. [39]

    Ineffectiveness of Reverse Wording of Questionnaire Items: Let’s Learn from Cows in the Rain

    Sonderen EV, Sanderman R, Coyne JC. Ineffectiveness of Reverse Wording of Questionnaire Items: Let’s Learn from Cows in the Rain. PLoS ONE. 2013;8(7):e68967. https://doi.org/10.1371/journa l.pone.0068967

  40. [40]

    This is not a dataset: A large negation benchmark to challenge large language models

    Garc´ ıa-Ferrero I, Altuna B, Alvez J, Gonzalez-Dios I, Rigau G. This is not a dataset: A large negation benchmark to challenge large language models. Proceedings of the 2023 conference on empirical methods in natural language processing. 2023;p. 8596–8615. https://doi.org/10.18653/v1/2023.emn lp-main.531

  41. [41]

    Factors defined by negatively keyed items: The result of careless respondents? Applied Psychological Measurement

    Schmitt N, Stults DM. Factors defined by negatively keyed items: The result of careless respondents? Applied Psychological Measurement. 1985;9(4):367–373. https://doi.org/10.1177/01466216850090 0405

  42. [42]

    Training data limits the prediction of consumer heterogeneity by LLM-based digital twins

    Krefeld-Schwalb A, Johnson E, Weaver C, Wulff DU. Training data limits the prediction of consumer heterogeneity by LLM-based digital twins. OSF Preprints. 2026;https://doi.org/10.31234/osf.io/97 dya v1

  43. [43]

    Development and validation of large language model rating scales for automatically transcribed psychological therapy sessions

    Eberhardt ST, Vehlen A, Schaffrath J, Schwartz B, Baur T, Schiller D, et al. Development and validation of large language model rating scales for automatically transcribed psychological therapy sessions. Scientific Reports. 2025;15(1):29541. https://doi.org/10.1038/s41598-025-14923-y

  44. [44]

    Evaluating and Inducing Personality in Pre-trained Language Models

    Jiang G, Xu M, Zhu SC, Han W, Zhang C, Zhu Y. Evaluating and Inducing Personality in Pre-trained Language Models. Advances in Neural Information Processing Systems. 2023;36:10622–10643. 16

  45. [45]

    The scientific value of numerical measures of human feelings

    Kaiser C, Oswald AJ. The scientific value of numerical measures of human feelings. Proceedings of the National Academy of Sciences. 2022;119(42):e2210412119. https://doi.org/10.1073/pnas.22104 12119

  46. [46]

    Findings of the Association for Computational Linguistics: ACL 2023 , pages=

    Perez E, Ringer S, Lukosiute K, Nguyen K, Chen E, Heiner S, et al. Discovering Language Model Behaviors with Model-Written Evaluations. Findings of the Association for Computational Linguistics: ACL 2023. 2023;p. 13387–13434. https://doi.org/10.18653/v1/2023.findings-acl.847

  47. [47]

    Towards Under- standing Sycophancy in Language Models

    Sharma M, Tong M, Korbak T, Duvenaud D, Askell A, Bowman SR, et al. Towards Under- standing Sycophancy in Language Models. The Twelfth International Conference on Learning Representations. 2024;https://openreview.net/forum?id=tvhaxkMKAn

  48. [48]

    Using the BIDR to distinguish the effects of impression management and self- deception on the criterion validity of personality measures: A meta-analysis

    Li A, Bagger J. Using the BIDR to distinguish the effects of impression management and self- deception on the criterion validity of personality measures: A meta-analysis. International Journal of Selection and Assessment. 2006;14(2):131–141. https://doi.org/10.1111/j.1468-2389.2006.00339.x

  49. [49]

    How human-like is LLM cognition? OSF Preprints

    Hussain Z, Mata R, Wulff DU. How human-like is LLM cognition? OSF Preprints. 2026;https: //doi.org/10.31234/osf.io/2yvnt v2

  50. [50]

    Large language models without grounding recover non-sensorimotor but not sensorimotor features of human concepts

    Xu Q, Peng Y, Nastase SA, Chodorow M, Wu M, Li P. Large language models without grounding recover non-sensorimotor but not sensorimotor features of human concepts. Nature human behaviour. 2025;9(9):1871–1886. https://doi.org/10.1038/s41562-025-02203-8

  51. [51]

    Failure of contextual invariance in large language models

    Kumar S, Flint A, Aiello LM, Baronchelli A. Failure of contextual invariance in gender inference with large language models. arXiv preprint arXiv:260323485. 2026;https://doi.org/10.48550/arXiv .2603.23485

  52. [52]

    A rebuttal of two common deflationary stances against LLM cognition

    Hussain Z, Mata R, Wulff DU. A rebuttal of two common deflationary stances against LLM cognition. Findings of the Association for Computational Linguistics. 2025;p. 24208–24213. https://doi.org/10 .18653/v1/2025.findings-acl.1242

  53. [53]

    Maciejowska, B

    Kriegmair V, Wulff DU. Machine individuality: Separating genuine idiosyncrasy from response bias in large language models. arXiv preprint arXiv:260416755. 2026;https://doi.org/10.48550/arXiv.2 604.16755

  54. [54]

    LMMs-eval: Reality check on the evaluation of large multimodal models

    Faulborn M, Sen I, Pellert M, Spitz A, Garcia D. Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. 2025;p. 31684–31704. https://doi.org/10.18653/v1/2025 .acl-long.1529

  55. [55]

    A Domain-Specific Risk-Taking (DOSPERT) scale for adult populations

    Blais AR, Weber EU. A Domain-Specific Risk-Taking (DOSPERT) scale for adult populations. Judgment and Decision Making. 2006;1(1):33–47. https://doi.org/10.1017/S1930297500000334

  56. [56]

    Evaluation of a behavioral measure of risk taking: The Balloon Analogue Risk Task (BART)

    Lejuez CW, Read JP, Kahler CW, Richards JB, Ramsey SE, Stuart GL, et al. Evaluation of a behavioral measure of risk taking: The Balloon Analogue Risk Task (BART). Journal of Experimental Psychology: Applied. 2002;8(2):75–84. https://doi.org/10.1037/1076-898X.8.2.75

  57. [57]

    Is Machine Psychology here? On Requirements for Using Human Psychological Tests on Large Language Models

    L¨ ohn L, Kiehne N, Ljapunov A, Balke WT. Is Machine Psychology here? On Requirements for Using Human Psychological Tests on Large Language Models. Proceedings of the 17th International Natural Language Generation Conference. 2024;p. 230–242. https://doi.org/10.18653/v1/2024.inl g-main.19

  58. [58]

    The behavioral and social sciences need open LLMs

    Wulff DU, Hussain Z, Mata R. The behavioral and social sciences need open LLMs. OSF Preprints. 2024;https://doi.org/10.31219/osf.io/ybvzs

  59. [59]

    Post-training makes large language models less human-like

    Binz M, Akata E, Almaatouq A, Alsobay M, Ariasov O, Br¨ andle F, et al. Post-training makes large language models less human-like. arXiv preprint arXiv:260507632. 2026;https://doi.org/10.48550/a rXiv.2605.07632. 17

  60. [60]

    Psychometric Predictive Power of Large Language Models

    Kuribayashi T, Oseki Y, Baldwin T. Psychometric Predictive Power of Large Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024;p. 1983–2005. https: //doi.org/10.18653/v1/2024.findings-naacl.129. 18