pith. sign in

arxiv: 2606.12730 · v1 · pith:NELB2PSCnew · submitted 2026-06-10 · 💻 cs.AI · cs.CL· cs.CY· cs.LG

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

Pith reviewed 2026-06-27 09:34 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CYcs.LG
keywords LLM evaluationself-report coherenceTheory of Planned BehaviorBig Fiveconversational contextpersona promptingbehavior prediction
0
0 comments X

The pith

TPB self-reports predict LLM behavior at human levels in shared chats unlike Big 5

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether self-report probes can anticipate LLM behavioral tendencies for safe deployment. It contrasts broad Big 5 personality traits, which predict specific behaviors weakly even in humans, with the Theory of Planned Behavior that targets specific intentions and predicts human behavior better. Experiments across four behavioral tasks and 11 frontier models, while varying session context and identity induction, show that coherence between self-reports and behavior is selective. Within a shared conversation TPB reaches human-level coherence while Big 5 does not; across separate conversations coherence survives only for behaviors anchored outside the immediate prompt and collapses when behavior is strongly primed by context. Persona prompting makes self-reports more consistent but does not align behavior. The findings indicate that coarse personality frameworks may not be the best tools for testing deployment behavior and that more task-specific instruments evaluated across contexts are needed.

Core claim

Within a shared conversation the Theory of Planned Behavior reaches human-level coherence with actual LLM behavior while Big 5 does not. Across separate conversations coherence survives only for behaviors anchored outside the immediate prompt such as implicit bias shaped by training and collapses when behavior is strongly primed by context such as sycophancy. Persona prompting makes self-reports more consistent across conversations but does not bring behavior into alignment.

What carries the argument

Theory of Planned Behavior instrument measuring intention targeted to a specific behavior, contrasted with Big 5 broad traits, evaluated by varying shared versus separate conversational sessions and presence of identity induction.

If this is right

  • Coarse personality frameworks such as Big 5 may not be the best tools for testing deployment behavior.
  • Task- and behavior-specific instruments are needed for psychometric evaluation of LLMs.
  • Evaluations of self-report coherence must be conducted across tasks and contexts rather than in isolated sessions.
  • Persona prompting increases self-report consistency without ensuring behavioral alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • LLMs may retain some stable traits from training that can be detected with anchored intention questions but remain highly sensitive to immediate conversational priming.
  • Evaluation protocols for safe deployment should routinely include both shared-conversation and separate-conversation conditions to distinguish stable from context-dependent coherence.
  • Future work could test whether anchoring questions to training-derived behaviors improves prediction reliability across model families.

Load-bearing premise

The chosen behavioral tasks and self-report instruments validly isolate the effects of conversation context and identity induction without confounding from LLM training data or prompt artifacts.

What would settle it

Observing that TPB self-reports fail to correlate with measured behavior even within shared conversations across the four tasks, or that Big 5 self-reports show human-level coherence under the same conditions.

Figures

Figures reproduced from arXiv: 2606.12730 by Anima Anandkumar, Dean Mobbs, Myrl G. Marmarelis, Peiyang Song, Pengrui Han, Rafal Kocielnik, Ramit Debnath, R. Michael Alvarez.

Figure 1
Figure 1. Figure 1: Experimental framework for analyzing self-report–behavior coherence in LLMs. We investigate (RQ1) whether self-reports predict behavior under ideal conditions, then relax along three axes: (RQ2) fine-grained TPB → coarse-grained Big Five; (RQ3) within-session → between-session; (RQ4) parameter grid → persona grid. Evaluation spans 4 behavioral tasks and 11 LLMs. for the measurement-level case: existing stu… view at source ↗
Figure 2
Figure 2. Figure 2: RQ1: Theory of Planned Behavior (TPB) self-report substantially predicts behavior under same-session conditions, with task-specific patterns matching TPB’s theoretical scope. Within-model Pearson correlations between TPB self-report and behavior; grid induction, ∼54 observations per cell. (A) Per-task Fisher-z mean r for Intention (grey) vs. theoretically-primary construct (orange); IAT inversion is theore… view at source ↗
Figure 3
Figure 3. Figure 3: RQ2: TPB substantially outperforms Big Five in predicting LLM behavior under within-session, parameter-grid conditions. (A1, A2) Per-task Fisher-z aggregated mean raligned (sign-corrected to theoretical direction) for TPB constructs (orange) and Big Five traits (blue). (B) Per-model TPB vs. Big Five comparison; models sorted by TPB raligned descending. (C) Big Five per-cell within-model raligned heatmap. T… view at source ↗
Figure 4
Figure 4. Figure 4: RQ3: Context separation collapses TPB-behavior coherence for most models, with Sycophancy showing the sharpest collapse. All panels: TPB constructs, grid perturbation only. (A1, A2) Per-task Fisher-z mean raligned for TPB Intention (light) and theoretically-primary construct (dark), in same-session (orange) vs. separate-sessions (teal). (B) Per-model TPB raligned pooled across task×construct cells; same-se… view at source ↗
Figure 5
Figure 5. Figure 5: RQ4: persona induction does not rescue separate-sessions coherence, despite produc￾ing more diverse and more stable SR profiles. All panels: TPB constructs, separate-sessions only. (A1, A2) Per-task Fisher-z mean raligned for TPB Intention (light) and theoretically-primary construct (dark), under grid (teal) vs. persona (purple) induction. (B) Per-model paired bars: separate-sessions grid (teal) vs. person… view at source ↗
Figure 6
Figure 6. Figure 6: Big Five self-report fingerprints across 11 models (within-session). Each panel shows one model’s mean Likert score (1–5 scale) across the five Big Five traits under parameter-grid (blue) and persona (red) inductions. Shaded regions: ±1 SD across conditions. 1 2 3 4 5 6 7 Claude Haiku 4.5 Claude 3.7 Sonnet GPT-4o Mini Gemini 2.5 Flash DeepSeek V3.1 LLaMA-4 Maverick Att. SN PBC Int. 1 2 3 4 5 6 7 LLaMA-3.3 … view at source ↗
Figure 7
Figure 7. Figure 7: TPB self-report fingerprints, parameter-grid induction (within-session). Each panel: one model. X-axis: 4 TPB constructs (Att = Attitude, SN = Subjective Norm, PBC = Perceived Behavioral Control, Int = Intention). Lines coloured by behavioral task. Shaded: ±1 SD across conditions. Sycophancy levels swing dramatically between session types. Comparing the Sycophancy col￾umn across Figures 9 and 10 reveals th… view at source ↗
Figure 8
Figure 8. Figure 8: TPB self-report fingerprints, persona induction (within-session). Same structure as [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Behavioral fingerprints under same-session probing. Each panel: one model’s mean behavioral score across the 5 dimensions (Risk Taking, Sycophancy, Epistemic Honesty, Self￾Reflective Honesty, Stereotyping), mapped to a 1–5 common scale ( [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Behavioral fingerprints under separate-sessions probing. Same structure as [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Full session × induction interaction for TPB theoretically primary construct (CCT→PBC, Sycophancy→Subjective Norm, Honesty→Attitude, IAT→Intention). Four bars per task: within-grid (dark orange), within-personas (light orange), between-grid (dark teal), between￾personas (light teal). Error bars are 95% Fisher-z CIs. Cross-reference: main-text [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: RQ4 prerequisites. (A) Discriminability: mean within-model SR-SD across conditions per framework–construct, parameter grid (dark) vs. persona grounding (light); pooled across tasks, within-session only. Stars indicate paired-Wilcoxon p < .05 / < .01 / < .001 across models. (B) Stability: Fisher-z aggregated r(SRwithin, SRbetween) across matched conditions, with 95% CI. Persona induction produces significa… view at source ↗
Figure 13
Figure 13. Figure 13: Between-model SR–behavior scatter plots: TPB constructs, parameter-grid induction (n = 11 models per panel). Each dot is one model; axes are model-mean z-standardised self-report (x) and model-mean z-standardised behavior (y). Regression slope βˆ between (Mundlak pooled OLS, cluster-robust SEs) is annotated in each panel and matches [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Between-model SR–behavior scatter plots: Big Five traits, parameter-grid induction (n = 11 models per panel). Layout mirrors [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Between-model SR–behavior scatter plots: TPB constructs, persona induction (n = 11 models per panel). Same layout and estimand as [PITH_FULL_IMAGE:figures/full_fig_p035_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Between-model SR–behavior scatter plots: Big Five traits, persona induction (n = 11 models per panel). Same layout and shading convention as [PITH_FULL_IMAGE:figures/full_fig_p036_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Within-model SR–behavior scatter plots: TPB constructs, parameter-grid induction (n ≈ 594 demeaned observations per panel). Both axes demeaned by per-model mean; r is Fisher-z transformed. Circle = Policy A, triangle = Policy B. Blue-shaded panels mark the theoretically￾primary TPB construct per task ( [PITH_FULL_IMAGE:figures/full_fig_p037_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Within-model SR–behavior scatter plots: Big Five traits, parameter-grid induction (n ≈ 297–378 demeaned observations per panel). Both axes demeaned by per-model mean. The annotated statistic is raligned = signtheory × r(trait, raw_beh), matching the paper’s Big Five sign convention exactly. All panels display near-zero raligned, confirming the main-text RQ2 claim that Big Five fails to predict behavior at… view at source ↗
Figure 19
Figure 19. Figure 19: Within-model SR–behavior scatter plots: TPB constructs, persona induction (n ≈ 660 demeaned observations per panel). Same layout, demeaning, and blue-shading convention as [PITH_FULL_IMAGE:figures/full_fig_p039_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Within-model SR–behavior scatter plots: Big Five traits, persona induction (n ≈ 297–378 demeaned observations per panel). Same layout as [PITH_FULL_IMAGE:figures/full_fig_p040_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: TPB. Three rows per task: between-session behavior (grey, top), within-session behavior split by Policy A vs. B (middle), SR Attitude split by Policy A vs. B (bottom). Right-side gutter reports |d| Cohen’s d and raw policy means in native units (Likert 1–7 for SR; task-native units for Behavior). Headline ratio RSR/Beh classifies tasks into the priming regime (R<1, Sycophancy and CCT) vs. the dispositiona… view at source ↗
Figure 22
Figure 22. Figure 22: Big Five. Three rows per task: between-session behavior (grey, top); within-session behavior (single dark distribution, since Big Five SR is policy-agnostic); SR for the two theoretically￾motivated Big Five traits per task (bottom). Trait z-scores are sign-flipped so that high-trait-along-its￾theory-direction matches Policy A direction on the behavior rows above, allowing within-column visual comparison a… view at source ↗
read the original abstract

Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in LLMs, but relied on broad personality traits (Big 5) that predict specific behaviors weakly, even in humans. Furthermore, the isolation of conversational sessions combined with weak context matching left open whether LLMs truly lack coherence or whether the conditions needed to detect such coherence were not met. We contrast Big 5 with the Theory of Planned Behavior (TPB), which measures intention targeted to a specific behavior and predicts human behavior substantially better than broad traits. We run experiments across four behavioral tasks and 11 frontier LLMs, while also varying session context and identity induction. We find that SR-behavior coherence exists but is selective. 1) Within a shared conversation, the Theory of Planned Behavior reaches human-level coherence; Big 5 does not. 2) Across separate conversations, coherence survives only for behaviors anchored outside the immediate prompt, such as implicit bias shaped by training, and collapses when behavior is strongly primed by context, as with sycophancy. 3) Persona prompting makes self-reports more consistent across conversations, but does not bring behavior into alignment. These findings suggest that coarse personality frameworks, such as Big 5 may not be the best tools for testing deployment behavior. More task- and behavior-specific instruments are needed, and even these must be evaluated across tasks and contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript examines when self-reports from large language models (LLMs) reliably predict their behavioral tendencies. It contrasts broad Big Five personality traits, which show weak prediction even in humans, with the Theory of Planned Behavior (TPB) that targets specific intentions. Experiments across four behavioral tasks and 11 frontier LLMs, manipulating session context and identity induction, reveal selective SR-behavior coherence: TPB achieves human-level coherence within shared conversations unlike Big 5; coherence persists for out-of-prompt behaviors but collapses for context-primed ones like sycophancy; persona prompting improves SR consistency but not behavioral alignment. The work concludes that task-specific instruments are preferable for psychometric evaluation of LLMs.

Significance. If the findings hold, this paper makes a significant contribution by demonstrating that psychometric evaluation of LLMs requires more nuanced, behavior-specific tools rather than off-the-shelf personality inventories. It provides empirical evidence for the conditions under which self-reports can be trusted, with direct relevance to safe AI deployment. The selective nature of coherence and the role of conversational context are important insights that challenge assumptions in current LLM evaluation practices.

major comments (2)
  1. [Abstract] Abstract: The assertion that TPB 'reaches human-level coherence' (while Big 5 does not) is not anchored by a matched human experiment on the identical four behavioral tasks, the same TPB items, or the same context/identity manipulations. Without this direct comparison, the quantitative threshold for 'human-level' is undefined and the selectivity claim cannot be evaluated against the invoked human benchmark.
  2. [Methods / Experiments] The description of the four behavioral tasks and TPB/Big 5 instruments (likely in the Methods or §4) provides no indication of exact item wording, task definitions, sample sizes, or statistical tests. This prevents verification that the reported patterns isolate conversational context effects without confounding from LLM training data or prompt artifacts, which is load-bearing for the dissociation and coherence claims.
minor comments (2)
  1. Add error bars or confidence intervals to all figures reporting coherence metrics across LLMs and conditions.
  2. [Discussion] Clarify in the text whether the human TPB literature cited uses comparable behavioral anchors or relies on self-reported behavior.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and will make the indicated revisions to improve clarity and verifiability.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that TPB 'reaches human-level coherence' (while Big 5 does not) is not anchored by a matched human experiment on the identical four behavioral tasks, the same TPB items, or the same context/identity manipulations. Without this direct comparison, the quantitative threshold for 'human-level' is undefined and the selectivity claim cannot be evaluated against the invoked human benchmark.

    Authors: We agree the abstract phrasing is imprecise. The 'human-level' reference is drawn from meta-analytic correlations in the existing human TPB literature rather than a new matched human experiment on our tasks. We will revise the abstract to explicitly cite the relevant human benchmarks (e.g., Armitage & Conner 2001) and state that the comparison is to published human values, not a concurrent study. This clarifies the quantitative basis without requiring new human data collection. revision: yes

  2. Referee: [Methods / Experiments] The description of the four behavioral tasks and TPB/Big 5 instruments (likely in the Methods or §4) provides no indication of exact item wording, task definitions, sample sizes, or statistical tests. This prevents verification that the reported patterns isolate conversational context effects without confounding from LLM training data or prompt artifacts, which is load-bearing for the dissociation and coherence claims.

    Authors: We will expand the Methods section (and any supplementary materials) to provide the complete item wording for all TPB and Big 5 measures, exact task definitions and prompts, sample sizes per condition and model, and the full statistical procedures including tests, effect sizes, and corrections. These additions will allow direct verification that the observed patterns are attributable to the manipulated factors. revision: yes

Circularity Check

0 steps flagged

Empirical study with no derivation chain or self-citation circularity

full rationale

This is an empirical experimental study that runs LLMs on behavioral tasks and compares self-report instruments (Big 5 vs. TPB) under varying context and identity conditions. No equations, fitted parameters, or derivations are present that could reduce reported findings to inputs by construction. The central claims rest on direct experimental measurements rather than any of the enumerated circular patterns (self-definitional, fitted-input-as-prediction, or load-bearing self-citation chains). The absence of a matched human baseline on identical tasks is a potential validity concern but does not constitute circularity in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that human-validated psychometric instruments transfer meaningfully to LLMs and that the experimental manipulations isolate coherence. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The Theory of Planned Behavior measures intention targeted to a specific behavior and predicts human behavior substantially better than broad traits.
    Invoked in the abstract to justify the contrast with Big 5.

pith-pipeline@v0.9.1-grok · 5840 in / 1300 out tokens · 33379 ms · 2026-06-27T09:34:59.178845+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 10 canonical work pages

  1. [1]

    Performance of a large language model on the reasoning tasks of a physician.Science, 392(6797):524–527, 2026

    Peter G Brodeur, Thomas A Buckley, Zahir Kanjee, Ethan Goh, Evelyn Bin Ling, Priyank Jain, Stephanie Cabral, Raja-Elie Abdulnour, Adrian D Haimovich, Jason A Freed, et al. Performance of a large language model on the reasoning tasks of a physician.Science, 392(6797):524–527, 2026

  2. [2]

    Takehiro Takayanagi, Kiyoshi Izumi, Javier Sanz-Cruzado, Richard McCreadie, and Iadh Ounis. Are generative AI agents effective personalized financial advisors? InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25), 2025. doi: 10.1145/3726302.3729897

  3. [3]

    Exploring the potential of LLM to enhance teaching plans through teaching simulation.npj Science of Learning, 10:7, 2025

    Bing Hu, Junjie Zhu, Yu Pei, et al. Exploring the potential of LLM to enhance teaching plans through teaching simulation.npj Science of Learning, 10:7, 2025. doi: 10.1038/ s41539-025-00300-x

  4. [4]

    Ai impact on human proof formalization workflows

    Katherine M Collins, Simon Frieder, Jonas Bayer, Jacob Loader, Jeck Lim, Peiyang Song, Fabian Zaiser, Lexin Zhou, Shanda Li, Shi-Zhuo Looi, et al. Ai impact on human proof formalization workflows. InThe 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025

  5. [5]

    A psychometric framework for evaluating and shaping personality traits in large language models.Nature Machine Intelligence, pages 1–15, 2025

    Gregory Serapio-García, Mustafa Safdari, Clément Crepy, Luning Sun, Stephen Fitz, Peter Romero, Marwa Abdulhai, Aleksandra Faust, and Maja Matari´c. A psychometric framework for evaluating and shaping personality traits in large language models.Nature Machine Intelligence, pages 1–15, 2025

  6. [6]

    The self-report method.Handbook of research methods in personality psychology, 1(2007):224–239, 2007

    Delroy L Paulhus, Simine Vazire, et al. The self-report method.Handbook of research methods in personality psychology, 1(2007):224–239, 2007

  7. [7]

    The personality illusion: Revealing dissociation between self-reports & behavior in llms

    Pengrui Han, Rafal Dariusz Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anandkumar, and R Michael Alvarez. The personality illusion: Revealing dissociation between self-reports & behavior in llms. InNeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning, 2025

  8. [8]

    Training language models to be warm can reduce accuracy and increase sycophancy.Nature, 652:1159–1165, 2026

    Lujain Ibrahim, Franziska Sofia Hafner, and Luc Rocher. Training language models to be warm can reduce accuracy and increase sycophancy.Nature, 652:1159–1165, 2026. doi: 10.1038/s41586-026-10410-0

  9. [9]

    Language models transmit behavioural traits through hidden signals in data.Nature, 652:615–621, 2026

    Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Sören Mindermann, Jacob Hilton, Samuel Marks, and Owain Evans. Language models transmit behavioural traits through hidden signals in data.Nature, 652:615–621, 2026. doi: 10.1038/s41586-026-10319-8

  10. [10]

    self-report

    Huiqi Zou, Pengda Wang, Zihan Yan, Tianjun Sun, and Ziang Xiao. Can LLM “self-report”? evaluating the validity of self-report scales in measuring personality design in LLM-based chatbots.arXiv preprint arXiv:2412.00207, 2024

  11. [11]

    Do psychometric tests work for large language models? evaluation of tests on sexism, racism, and morality.arXiv preprint arXiv:2510.11254, 2025

    Jana Jung, Marlene Lutz, Indira Sen, and Markus Strohmaier. Do psychometric tests work for large language models? evaluation of tests on sexism, racism, and morality.arXiv preprint arXiv:2510.11254, 2025

  12. [12]

    Toward a science of hu- man–AI teaming for decision making: A complemen- tarity framework

    Aadesh Salecha, Molly E. Ireland, Shashanka Subrahmanya, João Sedoc, Lyle H. Ungar, and Johannes C. Eichstaedt. Large language models display human-like social desirability biases in Big Five personality surveys.PNAS Nexus, 3(12):pgae533, 2024. doi: 10.1093/pnasnexus/ pgae533. 10

  13. [13]

    Dorner, Samira Samadi, and Augustin Kelava

    Tom Sühr, Florian E. Dorner, Samira Samadi, and Augustin Kelava. Challenging the validity of personality tests for large language models. InProceedings of the 5th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO ’25), 2025. doi: 10.1145/3757887.3763016. Earlier version: arXiv:2311.05297

  14. [14]

    Randomness, not representation: The unreliability of evaluating cultural alignment in llms.arXiv preprint arXiv:2503.08688, 2025

    Ariba Khan, Stephen Casper, and Dylan Hadfield-Menell. Randomness, not representation: The unreliability of evaluating cultural alignment in llms.arXiv preprint arXiv:2503.08688, 2025

  15. [15]

    Questioning the survey responses of large language models.Advances in Neural Information Processing Systems, 37:45850–45878, 2024

    Ricardo Dominguez-Olmedo, Moritz Hardt, and Celestine Mendler-Dünner. Questioning the survey responses of large language models.Advances in Neural Information Processing Systems, 37:45850–45878, 2024

  16. [16]

    Self-assessment tests are unreliable measures of llm personality.arXiv preprint arXiv:2309.08163, 2023

    Akshat Gupta, Xiaoyang Song, and Gopala Anumanchipalli. Self-assessment tests are unreliable measures of llm personality.arXiv preprint arXiv:2309.08163, 2023

  17. [17]

    Armin Klaps, Zuzana Kovacovsky, Bernhard Landrichter, and Birgit U. Stetina. Human traits in artificial minds: Personality construction in contemporary LLMs.Research Square preprint,

  18. [18]

    doi: 10.21203/rs.3.rs-8210799/v1

  19. [19]

    Large language model reasoning failures

    Peiyang Song, Pengrui Han, and Noah Goodman. Large language model reasoning failures. arXiv preprint arXiv:2602.06176, 2026

  20. [20]

    Personallm: Investigating the ability of large language models to express personality traits.Findings of NAACL 2024, 2024

    Hang Jiang, Xiang Zhang, Xiyao Cao, Cynthia Breazeal, Deb Roy, and Jad Kabbara. Personallm: Investigating the ability of large language models to express personality traits.Findings of NAACL 2024, 2024. URLhttps://arxiv.org/abs/2305.02547

  21. [21]

    Lechner, Claudia Wagner, Beatrice Rammstedt, and Markus Strohmaier

    Max Pellert, Clemens M. Lechner, Claudia Wagner, Beatrice Rammstedt, and Markus Strohmaier. Ai psychometrics: Assessing the psychological profiles of large language models through psychometric inventories.Perspectives on Psychological Science, 19(5):808–826, 2024

  22. [22]

    Personality traits in large language models.arXiv preprint arXiv:2307.00184, 2023

    Greg Serapio-García, Mustafa Safdari, Clément Crepy, Luning Sun, Stephen Fitz, Peter Romero, Marwa Abdulhai, Aleksandra Faust, and Maja Matari ´c. Personality traits in large language models.arXiv preprint arXiv:2307.00184, 2023

  23. [23]

    Big5-chat: Shaping llm personalities through training on human-grounded data

    Wenkai Li, Jiarui Liu, Andy Liu, Xuhui Zhou, Mona Diab, and Maarten Sap. Big5-chat: Shaping llm personalities through training on human-grounded data. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 20434–20471, 2025

  24. [24]

    The big-five trait taxonomy: History, measurement, and theoretical perspectives

    Oliver John. The big-five trait taxonomy: History, measurement, and theoretical perspectives. Published as, 1999

  25. [25]

    Walter Mischel and Yuichi Shoda. A cognitive-affective system theory of personality: recon- ceptualizing situations, dispositions, dynamics, and invariance in personality structure.Psycho- logical review, 102(2):246, 1995

  26. [26]

    Wiley, New York, 1968

    Walter Mischel.Personality and Assessment. Wiley, New York, 1968

  27. [27]

    Hemphill

    James F. Hemphill. Interpreting the magnitudes of correlation coefficients.American Psycholo- gist, 58(1):78–79, 2003. doi: 10.1037/0003-066X.58.1.78

  28. [28]

    Situation trait relevance, trait expression, and cross- situational consistency: Testing a principle of trait activation.Journal of Research in Personality, 34(4):397–423, 2000

    Robert P Tett and Hal A Guterman. Situation trait relevance, trait expression, and cross- situational consistency: Testing a principle of trait activation.Journal of Research in Personality, 34(4):397–423, 2000

  29. [29]

    A personality trait-based interactionist model of job performance.Journal of Applied psychology, 88(3):500, 2003

    Robert P Tett and Dawn D Burnett. A personality trait-based interactionist model of job performance.Journal of Applied psychology, 88(3):500, 2003

  30. [30]

    Efficacy of the theory of planned behaviour: A meta-analytic review.British journal of social psychology, 40(4):471–499, 2001

    Christopher J Armitage and Mark Conner. Efficacy of the theory of planned behaviour: A meta-analytic review.British journal of social psychology, 40(4):471–499, 2001

  31. [31]

    Prospective prediction of health-related behaviours with the theory of planned be- haviour: A meta-analysis.Health psychology review, 5(2):97–144, 2011

    Rosemary Robin Charlotte McEachan, Mark Conner, Natalie Jayne Taylor, and Rebecca Jane Lawton. Prospective prediction of health-related behaviours with the theory of planned be- haviour: A meta-analysis.Health psychology review, 5(2):97–144, 2011. 11

  32. [32]

    Scaling synthetic data creation with 1,000,000,000 personas.arXiv preprint arXiv:2406.20094, 2024

    Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas.arXiv preprint arXiv:2406.20094, 2024

  33. [33]

    Large language model agent in financial trading: A survey.arXiv preprint arXiv:2408.06361, 2024

    Han Ding, Yinheng Li, Junhao Wang, and Hang Chen. Large language model agent in financial trading: A survey.arXiv preprint arXiv:2408.06361, 2024

  34. [34]

    The theory of planned behavior.Organizational Behavior and Human Decision Processes, 50(2):179–211, 1991

    Icek Ajzen. The theory of planned behavior.Organizational Behavior and Human Decision Processes, 50(2):179–211, 1991

  35. [35]

    The unbearable automaticity of being.American psychologist, 54(7):462, 1999

    John A Bargh and Tanya L Chartrand. The unbearable automaticity of being.American psychologist, 54(7):462, 1999

  36. [36]

    Are large language models consistent over value-laden questions? InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15185–15221, 2024

    Jared Moore, Tanvi Deshpande, and Diyi Yang. Are large language models consistent over value-laden questions? InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15185–15221, 2024

  37. [37]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. InAdvances in Neural Information Processing Systems, volume 36, pages 74952–74965, 2023

  38. [38]

    In-context learning may not elicit trustworthy reasoning: A-not-b errors in pretrained language models

    Pengrui Han, Peiyang Song, Haofei Yu, and Jiaxuan You. In-context learning may not elicit trustworthy reasoning: A-not-b errors in pretrained language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 5624–5643, 2024

  39. [39]

    Open University Press, Milton Keynes, 1988

    Icek Ajzen and Martin Fishbein.Attitudes, Personality and Behaviour. Open University Press, Milton Keynes, 1988

  40. [40]

    Affective and deliberative processes in risky choice: age differences in risk taking in the columbia card task

    Bernd Figner, Rachael J Mackinlay, Friedrich Wilkening, and Elke U Weber. Affective and deliberative processes in risky choice: age differences in risk taking in the columbia card task. Journal of Experimental Psychology: Learning, Memory, and Cognition, 35(3):709, 2009

  41. [41]

    Studies of independence and conformity: I

    Solomon E Asch. Studies of independence and conformity: I. a minority of one against a unanimous majority.Psychological monographs: General and applied, 70(9):1, 1956

  42. [42]

    Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023

  43. [43]

    Thomas O Nelson and Louis Narens. Norms of 300 general-information questions: Accuracy of recall, latency of recall, and feeling-of-knowing ratings.Journal of verbal learning and verbal behavior, 19(3):338–368, 1980

  44. [44]

    Alignment for honesty.Advances in Neural Information Processing Systems, 37:63565–63598, 2024

    Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, and Pengfei Liu. Alignment for honesty.Advances in Neural Information Processing Systems, 37:63565–63598, 2024

  45. [45]

    Measuring individual differences in implicit cognition: the implicit association test.Journal of personality and social psychology, 74(6):1464, 1998

    Anthony G Greenwald, Debbie E McGhee, and Jordan LK Schwartz. Measuring individual differences in implicit cognition: the implicit association test.Journal of personality and social psychology, 74(6):1464, 1998

  46. [46]

    Chatgpt based data augmentation for improved parameter-efficient debiasing of llms

    Pengrui Han, Rafal Kocielnik, Adhithya Saravanan, Roy Jiang, Or Sharir, and Anima Anand- kumar. Chatgpt based data augmentation for improved parameter-efficient debiasing of llms. arXiv preprint arXiv:2402.11764, 2024

  47. [47]

    Be- yond profile: From surface-level facts to deep persona simulation in LLMs.arXiv preprint arXiv:2502.12988, 2025

    Zixiao Wang, Duzhen Zhang, Ishita Agrawal, Shen Gao, Le Song, and Xiuying Chen. Be- yond profile: From surface-level facts to deep persona simulation in LLMs.arXiv preprint arXiv:2502.12988, 2025

  48. [48]

    Hedges and Ingram Olkin.Statistical Methods for Meta-Analysis

    Larry V . Hedges and Ingram Olkin.Statistical Methods for Meta-Analysis. Academic Press, Orlando, FL, 1985

  49. [49]

    Edwin B. Wilson. Probable inference, the law of succession, and statistical inference.Journal of the American Statistical Association, 22(158):209–212, 1927

  50. [50]

    On the pooling of time series and cross section data.Econometrica, 46(1):69–85,

    Yair Mundlak. On the pooling of time series and cross section data.Econometrica, 46(1):69–85,

  51. [51]

    doi: 10.2307/1913646. 12

  52. [52]

    A practitioner’s guide to cluster-robust inference

    A Colin Cameron and Douglas L Miller. A practitioner’s guide to cluster-robust inference. Journal of human resources, 50(2):317–372, 2015

  53. [53]

    A meta-analysis on the correlation between the implicit association test and explicit self-report measures.Personality and Social Psychology Bulletin, 31(10):1369–1385, 2005

    Wilhelm Hofmann, Bertram Gawronski, Tobias Gschwendner, Huy Le, and Manfred Schmitt. A meta-analysis on the correlation between the implicit association test and explicit self-report measures.Personality and Social Psychology Bulletin, 31(10):1369–1385, 2005

  54. [54]

    Oswald, Gregory Mitchell, Hart Blanton, James Jaccard, and Philip E

    Frederick L. Oswald, Gregory Mitchell, Hart Blanton, James Jaccard, and Philip E. Tetlock. Predicting ethnic and racial discrimination: A meta-analysis of IAT criterion studies.Journal of Personality and Social Psychology, 105(2):171–192, 2013

  55. [55]

    Big five inventory.Journal of personality and social psychology, 1991

    Oliver P John, Eileen M Donahue, and Robert L Kentle. Big five inventory.Journal of personality and social psychology, 1991

  56. [56]

    Funder and C

    David C. Funder and C. Randall Colvin. Explorations in behavioral consistency: Properties of persons, situations, and behaviors.Journal of Personality and Social Psychology, 60(5): 773–794, 1991. doi: 10.1037/0022-3514.60.5.773

  57. [57]

    Brent W Roberts, Nathan R Kuncel, Rebecca Shiner, Avshalom Caspi, and Lewis R Goldberg. The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes.Perspectives on Psychological science, 2(4):313–345, 2007

  58. [58]

    Adaptation of agentic ai.arXiv preprint arXiv:2512.16301, 2025

    Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, et al. Adaptation of agentic ai.arXiv preprint arXiv:2512.16301, 2025

  59. [59]

    Steer2adapt: Dynamically composing steering vectors elicits efficient adaptation of llms.arXiv preprint arXiv:2602.07276, 2026

    Pengrui Han, Xueqiang Xu, Keyang Xuan, Peiyang Song, Siru Ouyang, Runchu Tian, Yuqing Jiang, Cheng Qian, Pengcheng Jiang, Jiashuo Sun, et al. Steer2adapt: Dynamically composing steering vectors elicits efficient adaptation of llms.arXiv preprint arXiv:2602.07276, 2026

  60. [60]

    Defeating nondeterminism in LLM inference

    Thinking Machines. Defeating nondeterminism in LLM inference. Think- ing Machines blog post, 2025. URL https://thinkingmachines.ai/blog/ defeating-nondeterminism-in-llm-inference/

  61. [61]

    best- effort

    Lexin Zhou, Lorenzo Pacchiardi, Fernando Martínez-Plumed, Katherine M. Collins, Yael Moros-Daval, Seraphina Zhang, Qinlin Zhao, Yitian Huang, Luning Sun, Jonathan E. Prunty, et al. General scales unlock AI evaluation with explanatory and predictive power.Nature, 652: 58–67, 2026. doi: 10.1038/s41586-026-10303-2

  62. [62]

    Interactive evaluation requires a design science

    Keyang Xuan, Peiyang Song, Pan Lu, Pengrui Han, Wenkai Li, Zhenyu Zhang, Zexue He, Wenyue Hua, Manling Li, Jiaxuan You, et al. Interactive evaluation requires a design science. arXiv preprint arXiv:2605.17829, 2026

  63. [63]

    Measurement and fairness

    Abigail Z Jacobs and Hanna Wallach. Measurement and fairness. InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 375–385, 2021

  64. [64]

    Large language model psychometrics: A systematic review of evaluation, validation, and enhancement.arXiv preprint arXiv:2505.08245, 2025

    Haoran Ye, Jing Jin, Yuhang Xie, Xin Zhang, and Guojie Song. Large language model psychometrics: A systematic review of evaluation, validation, and enhancement.arXiv preprint arXiv:2505.08245, 2025

  65. [65]

    Where does personality have its influence? a supermatrix of consistency concepts.Journal of personality, 76(6):1355–1386, 2008

    William Fleeson and Erik E Noftle. Where does personality have its influence? a supermatrix of consistency concepts.Journal of personality, 76(6):1355–1386, 2008

  66. [66]

    Toward a structure-and process-integrated view of personality: Traits as density distributions of states.Journal of personality and social psychology, 80(6):1011, 2001

    William Fleeson. Toward a structure-and process-integrated view of personality: Traits as density distributions of states.Journal of personality and social psychology, 80(6):1011, 2001

  67. [67]

    Creative and context-aware translation of east asian idioms with gpt-4

    Kenan Tang, Peiyang Song, Yao Qin, and Xifeng Yan. Creative and context-aware translation of east asian idioms with gpt-4. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9285–9305, 2024

  68. [68]

    Leibo, Alexander Sasha Vezhnevets, William A

    Joel Z. Leibo, Alexander Sasha Vezhnevets, William A. Cunningham, and Stanley M. Bileschi. A pragmatic view of AI personhood.arXiv preprint arXiv:2510.26396, 2025. 13

  69. [69]

    Elephant: Measuring and understanding social sycophancy in llms.arXiv preprint arXiv:2505.13995, 2025

    Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Elephant: Measuring and understanding social sycophancy in llms.arXiv preprint arXiv:2505.13995, 2025

  70. [70]

    How and why llms generalize: A fine-grained analysis of llm reasoning from cognitive behaviors to low-level patterns.arXiv preprint arXiv:2512.24063, 2025

    Haoyue Bai, Yiyou Sun, Wenjie Hu, Shi Qiu, Maggie Ziyu Huan, Peiyang Song, Robert Nowak, and Dawn Song. How and why llms generalize: A fine-grained analysis of llm reasoning from cognitive behaviors to low-level patterns.arXiv preprint arXiv:2512.24063, 2025

  71. [71]

    Why do multi- agent llm systems fail?Advances in Neural Information Processing Systems, 38, 2026

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent llm systems fail?Advances in Neural Information Processing Systems, 38, 2026

  72. [72]

    Exploring variability in risk taking with large language models.Journal of Experimental Psychology: General, 153(7):1838, 2024

    Sudeep Bhatia. Exploring variability in risk taking with large language models.Journal of Experimental Psychology: General, 153(7):1838, 2024

  73. [73]

    Shrout and Joseph L

    Patrick E. Shrout and Joseph L. Fleiss. Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin, 86(2):420–428, 1979

  74. [74]

    Colin Cameron and Douglas L

    A. Colin Cameron and Douglas L. Miller. A practitioner’s guide to cluster-robust inference. Journal of Human Resources, 50(2):317–372, 2015. A Further Discussion This appendix expands on three subsidiary points referenced in the main Discussion: alternative explanations of cross-session collapse, methodological implications for prior validation studies, a...

  75. [75]

    behavior itself decorrelates across sessions

    100% C1=C2 consistency 1 + 4(x/100) 50%→3.0 (half consistent) More C1–C2 consis- tency † The plotted score increases with overconfidence (i.e., lower honesty). E.1 Big Five self-report fingerprints Figure 6 shows each model’s mean Big Five trait profile under both parameter-grid and persona inductions, in the within-session condition. Profiles are highly ...

  76. [76]

    +0.42∗ personas; between-session +0.44∗ grid vs

    Honesty–attitude within-model coupling is robust across induction types, attenuated between sessions.Within-session βwithin = +0.47 ∗ grid vs. +0.42∗ personas; between-session +0.44∗ grid vs. +0.30∗∗∗ personas. All four cells are significantly positive at the within-model level. Persona induction reduces the magnitude but not the sign

  77. [77]

    Neither session separation nor induction format moves this coefficient

    IAT–intention dissociation is the most stable signal in the dataset.All four cells have highly significant negative βwithin (−0.60 to −0.73, all p < .001 ). Neither session separation nor induction format moves this coefficient

  78. [78]

    does behavior change?

    Sycophancy’s within-session coherence is a between-model phenomenon under personas. Under personas within, the between-model component βbetween = +0.39 is significant (p=.007 ) while the within-model component is null (βwithin = +0.05, p=.89 ). Under grid within, both are marginal. This inverts under between-session, where all four sycophancy cells are nu...

  79. [79]

    Pool sampling.A pool of 500 English-language personas is sampled uniformly at random from the full dataset (seed 42; ASCII ratio ≥0.95 ; maximum 500 characters per persona to limit noise in the TF-IDF representation)

  80. [80]

    TF-IDF vectorisation.All pool personas are vectorised with a unigram + bigram TF-IDF representation (sublinear TF scaling;min_df= 1,max_df= 0.95)

Showing first 80 references.