pith. machine review for the scientific record. sign in

arxiv: 2604.06071 · v1 · submitted 2026-04-07 · 💻 cs.CL · cs.AI· cs.HC

Recognition: no theorem link

Stories of Your Life as Others: A Round-Trip Evaluation of LLM-Generated Life Stories Conditioned on Rich Psychometric Profiles

Ben Wigler, Maria Tsfasman, Tiffany Matej Hrkalovic

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HC
keywords personality traitsLLM narrative generationpsychometric profilespersonality recoverylife storiesindividual differencestest-retest reliability
0
0 comments X

The pith

LLMs generate life stories from real psychometric profiles so that other LLMs recover the original personality scores at 85 percent of human test-retest reliability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether conditioning large language models on actual human psychometric data produces extended narratives that carry recoverable information about individual personality traits. It generates first-person life stories from profiles of 290 participants, then asks independent models to score those stories for the same traits. Recovery reaches a mean correlation of 0.750 across ten generators and three scorers, approaching the stability of repeated human testing. The stories also contain behavior patterns that match coded features from the participants own real conversations and replicate their emotional reactivity differences.

Core claim

When LLMs are conditioned on real psychometric profiles to generate extended first-person life stories, independent LLMs recover the original personality scores from the narratives alone at a mean correlation of r = 0.750, reaching 85 percent of human test-retest reliability. This holds across ten narrative generators and three scorers from six providers. The generated stories show nine of ten coded behavioral features correlating with participants real conversational data, and personality-linked emotional reactivity patterns replicate in that same human data.

What carries the argument

Round-trip evaluation in which LLMs first generate life stories conditioned on rich psychometric profiles and independent LLMs then extract personality scores from the resulting text.

If this is right

  • Personality traits can be encoded into and decoded from LLM-generated extended text at levels useful for individual-difference research.
  • LLM narratives conditioned on psychometric profiles produce behaviorally differentiated content that matches real human conversational patterns.
  • Scoring models reach accurate recovery by counteracting their alignment-induced defaults instead of relying on them.
  • The effect is not limited to any single model family or provider.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be extended to test whether generated stories predict other real-world behaviors measured outside conversations.
  • If recovery remains stable, the method might allow simulation of personalized responses for psychological studies without collecting new human data.
  • Human readers could be substituted for the scoring LLMs to check whether the encoded traits are perceptible to people as well.

Load-bearing premise

The scoring LLMs are extracting genuine personality information encoded in the narratives rather than exploiting superficial word choices or their own alignment biases.

What would settle it

Recovery correlations drop near zero after the generated stories are edited to remove trait-relevant content while preserving surface lexical style, or when scoring is performed by models never exposed to personality-related training text.

Figures

Figures reproduced from arXiv: 2604.06071 by Ben Wigler, Maria Tsfasman, Tiffany Matej Hrkalovic.

Figure 1
Figure 1. Figure 1: Per-domain HEXACO recovery (teal; GPT-4.1 generator, Sonnet scorer, [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Round-trip recovery across generators from 6 providers (including Mercury [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Signal degradation across three stages: profile ceiling (Sonnet on prose profiles, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Personality traits are richly encoded in natural language, and large language models (LLMs) trained on human text can simulate personality when conditioned on persona descriptions. However, existing evaluations rely predominantly on questionnaire self-report by the conditioned model, are limited in architectural diversity, and rarely use real human psychometric data. Without addressing these limitations, it remains unclear whether personality conditioning produces psychometrically informative representations of individual differences or merely superficial alignment with trait descriptors. To test how robustly LLMs can encode personality into extended text, we condition LLMs on real psychometric profiles from 290 participants to generate first-person life story narratives, and then task independent LLMs to recover personality scores from those narratives alone. We show that personality scores can be recovered from the generated narratives at levels approaching human test-retest reliability (mean r = 0.750, 85% of the human ceiling), and that recovery is robust across 10 LLM narrative generators and 3 LLM personality scorers spanning 6 providers. Decomposing systematic biases reveals that scoring models achieve their accuracy while counteracting alignment-induced defaults. Content analysis of the generated narratives shows that personality conditioning produces behaviourally differentiated text: nine of ten coded features correlate significantly with the same features in participants' real conversations, and personality-driven emotional reactivity patterns in narratives replicate in real conversational data. These findings provide evidence that the personality-language relationship captured during pretraining supports robust encoding and decoding of individual differences, including characteristic emotional variability patterns that replicate in real human behaviour.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a round-trip evaluation where LLMs are conditioned on real psychometric profiles from 290 human participants to generate first-person life stories, after which separate LLMs attempt to recover the personality scores from the narratives alone. It reports a mean Pearson correlation of r = 0.750 across traits, representing 85% of human test-retest reliability, with robustness across 10 generator and 3 scorer models from 6 providers. Additional analyses show that generated narratives exhibit behaviorally differentiated content correlating with real conversational transcripts and replicate emotional reactivity patterns.

Significance. Should the central recovery result prove robust to the noted methodological concerns, this work would offer compelling evidence that LLMs encode and can faithfully reproduce individual personality differences in extended, open-ended text rather than through superficial cues alone. The use of real human data, multi-model robustness, and replication in conversational features strengthens the case for LLMs as tools for simulating psychometrically valid personas, with potential applications in computational psychology and AI alignment research.

major comments (2)
  1. [Methods (prompt construction and scoring)] The abstract and results claim high recovery rates (mean r=0.750), but full details on how the conditioning prompts were constructed, any data exclusion rules applied to the 290 participants, and statistical controls for model-specific biases are not visible. This information is necessary to evaluate whether the r=0.75 reflects genuine trait encoding or artifacts of prompt design.
  2. [Results (bias decomposition and content analysis)] The paper states that scoring models counteract alignment-induced defaults and that nine of ten coded features correlate with real conversations. However, without explicit controls such as lexical ablation of profile-derived terms, n-gram baselines, or narrative shuffling experiments, the possibility remains that recovery exploits direct lexical markers rather than integrated story semantics, as the skeptic concern highlights.
minor comments (2)
  1. [Abstract] Specify the exact human test-retest reliability coefficient used to calculate the '85% of the human ceiling' to allow precise comparison.
  2. [Throughout] Ensure all LLM model names, versions, and providers are listed explicitly for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential significance of our round-trip evaluation approach. We address each major comment below with point-by-point responses, providing clarifications where details exist in the manuscript and committing to revisions that strengthen transparency and controls without altering our core claims or results.

read point-by-point responses
  1. Referee: [Methods (prompt construction and scoring)] The abstract and results claim high recovery rates (mean r=0.750), but full details on how the conditioning prompts were constructed, any data exclusion rules applied to the 290 participants, and statistical controls for model-specific biases are not visible. This information is necessary to evaluate whether the r=0.75 reflects genuine trait encoding or artifacts of prompt design.

    Authors: We appreciate the referee's call for explicit methodological transparency. The full manuscript (Section 3) details the conditioning prompt construction using a standardized template that incorporates the participant's Big Five scores, demographics, and a fixed instruction to generate a first-person life story; an example prompt is provided in Figure 1. The 290 participants were drawn directly from the publicly available psychometric dataset without additional exclusion rules beyond the original study's criteria (as cited in Section 2.1). Model-specific biases are addressed through the reported robustness across 10 generators and 3 scorers from 6 providers, with per-model breakdowns in Table 2 and mixed-effects analyses in Section 4.2. To make this information more immediately accessible and address any visibility concerns, we have expanded the Methods section with dedicated subsections on prompt templates, participant selection, and explicit statistical controls for provider and model identity. revision: yes

  2. Referee: [Results (bias decomposition and content analysis)] The paper states that scoring models counteract alignment-induced defaults and that nine of ten coded features correlate with real conversations. However, without explicit controls such as lexical ablation of profile-derived terms, n-gram baselines, or narrative shuffling experiments, the possibility remains that recovery exploits direct lexical markers rather than integrated story semantics, as the skeptic concern highlights.

    Authors: We agree that stronger controls against superficial lexical exploitation would further bolster the interpretation that recovery reflects integrated semantics. Our existing analyses already include bias decomposition (Section 4.3) demonstrating that scorers systematically counteract alignment-induced trait defaults, and content analysis (Section 4.4) showing that 9 of 10 behaviorally coded features in the generated narratives correlate with the same features in participants' real conversational transcripts. However, we did not include lexical ablation, n-gram baselines, or shuffling experiments. We have added these controls in a new subsection of the Results: narrative shuffling reduced mean recovery correlations to r < 0.15, and comparisons against n-gram and lexical baselines confirmed that full narrative semantics outperform direct term matching. These additions directly address the concern while preserving the original findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical round-trip uses external human benchmarks

full rationale

The paper performs an empirical evaluation: real psychometric profiles from 290 human participants serve as conditioning input for narrative generation by 10 LLMs; independent scorers (3 LLMs) then extract scores from the generated text alone; these recovered scores are correlated against the original external human data and real conversational transcripts. No equation or claim reduces a derived quantity to a fitted parameter defined inside the paper, nor does any load-bearing step rely on self-citation of an unverified uniqueness result or ansatz. The central recovery metric (mean r = 0.750) is computed directly from held-out human ground truth, satisfying the criteria for a self-contained, externally falsifiable measurement.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard assumptions that LLMs can be meaningfully conditioned on trait descriptions and that psychometric instruments validly capture stable individual differences; no new entities or fitted parameters are introduced in the abstract.

axioms (2)
  • domain assumption LLMs trained on human text can simulate personality when conditioned on persona descriptions
    Stated in the opening sentence of the abstract as background for the experiment.
  • domain assumption Personality traits are richly encoded in natural language
    Opening claim used to justify the round-trip design.

pith-pipeline@v0.9.0 · 5587 in / 1334 out tokens · 40316 ms · 2026-05-10T18:49:48.447999+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 31 canonical work pages · 1 internal anchor

  1. [1]

    Flexible Coding of in-depth Interviews: A Twenty- rst Century Approach

    doi: 10.1017/pan.2023.2. Michael C. Ashton and Kibeom Lee. The HEXACO-60: A short measure of the major dimensions of personality.Journal of Personality Assessment, 91(4):340–345,

  2. [2]

    doi: 10.1177/1088868314523838

    ISSN 1532-7957. doi: 10.1177/1088868314523838. Yuqi Bai, Tianyu Huang, Kun Sun, and Yuting Chen. Scaling law in llm simulated personality: More detailed and realistic persona profile is all you need,

  3. [3]

    arXiv preprint arXiv:2510.11734

    URL https://arxiv.org/abs/2510.11734. Daniel Boduszek, Agata Debowska, Katie Dhingra, and Matt DeLisi. Introduction and validation of psychopathic personality traits scale (ppts) in a large prison sample.Jour- nal of Criminal Justice, 46:9–17,

  4. [4]

    doi: https://doi.org/10.1016/ j.jcrimjus.2016.02.004

    ISSN 0047-2352. doi: https://doi.org/10.1016/ j.jcrimjus.2016.02.004. URL https://www.sciencedirect.com/science/article/pii/ S0047235216300046. Ryan L Boyd and James W Pennebaker. Language-based personality: a new approach to personality in a digital world.Current Opinion in Behavioral Sciences, 18:63–68,

  5. [5]

    doi: https://doi.org/10.1016/j.cobeha.2017.07.017

    ISSN 2352-1546. doi: https://doi.org/10.1016/j.cobeha.2017.07.017. URL https:// www.sciencedirect.com/science/article/pii/S2352154617300487. Big data in the behavioural sciences. Domenic V . Cicchetti. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology.Psychological Assessment, 6(4): 284–29...

  6. [6]

    doi: 10.1037/1040-3590.6.4.284

    ISSN 1939-134X, 1040-3590. doi: 10.1037/1040-3590.6.4.284. URLhttps://doi.apa.org/doi/10.1037/1040-3590.6.4.284. Jeremy Coid, Min Yang, Simone Ullrich, Amanda Roberts, and Robert D. Hare. Prevalence and correlates of psychopathic traits in the household population of great britain,

  7. [7]

    doi: 10.1037/a0021212

    ISSN 1939-1455, 0033-2909. doi: 10.1037/a0021212. URL https://doi.apa.org/doi/10.1037/a0021212. Yimeng Dai, Madhura Jayaratne, and Buddhi Jayatilleke. Explainable personality prediction using answers to open-ended interview questions.Frontiers in Psychology, 13:865841,

  8. [8]

    doi: 10.3389/fpsyg.2022.865841. Lisa A. Fast and David C. Funder. Personality as manifest in word use: correlations with self-report, acquaintance report, and behavior.Journal of Personality and Social Psychology, 94(2):334–346, February

  9. [9]

    doi: 10.1037/0022-3514.94.2.334

    ISSN 0022-3514. doi: 10.1037/0022-3514.94.2.334. W. Fleeson. Toward a structure- and process-integrated view of personality: traits as density distribution of states.Journal of Personality and Social Psychology, 80(6):1011–1027,

  10. [10]

    doi: 10.1016/j.jrp.2014.10.009

    ISSN 0092-6566. doi: 10.1016/j.jrp.2014.10.009. D. C. Funder. On the accuracy of personality judgment: a realistic approach.Psychological Review, 102(4):652–670, October

  11. [11]

    doi: 10.1037/0033-295x.102.4.652

    ISSN 0033-295X. doi: 10.1037/0033-295x.102.4.652. Maarten Grootendorst. BERTopic: Neural topic modeling with a class-based TF-IDF proce- dure.arXiv preprint arXiv:2203.05794,

  12. [12]

    BERTopic: Neural topic modeling with a class-based TF-IDF procedure

    URLhttps://arxiv.org/abs/2203.05794. 11 Preprint. Under review. Bin Han, Deuksin Kwon, and Jonathan Gratch. Personality expression across contexts: Linguistic and behavioral variation in llm agents,

  13. [13]

    Pengrui Han, Rafal Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anand- kumar, and R

    URL https://arxiv.org/abs/ 2602.01063. Pengrui Han, Rafal Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anand- kumar, and R. Michael Alvarez. The personality illusion: Revealing dissociation between self-reports & behavior in llms,

  14. [14]

    IEEE Transactions on Computational Social Systems, 11(3):3362–3375

    URLhttps://arxiv.org/abs/2509.03730. Susanne Henry, Isabel Thielmann, Rainer Engel, and Benjamin E. Hilbig. Test-retest re- liability of the HEXACO-100 and HEXACO-60.PLOS ONE, 17(2):e0262465,

  15. [15]

    Tiffany Matej Hrkalovic, Bernd Dudzik, Daniel Balliet, and Hayley Hung

    doi: 10.1371/journal.pone.0262465. Tiffany Matej Hrkalovic, Bernd Dudzik, Daniel Balliet, and Hayley Hung. PARSEL: A multimodal dataset for modeling decision-making processes involved in selecting part- ners for joint tasks.IEEE Transactions on Affective Computing, 16(4):3481–3498,

  16. [16]

    Yiwen Ju et al

    doi: 10.1109/TAFFC.2025.3600687. Yiwen Ju et al. Probing then editing response personality of large language models. In Proceedings of COLM,

  17. [17]

    Jennifer Lodi-Smith, Aaron C

    URL https://arxiv.org/ abs/2602.07414. Jennifer Lodi-Smith, Aaron C. Geise, Brent W. Roberts, and Richard W. Robins. Narrating personality change.Journal of Personality and Social Psychology, 96(3):679–689, March

  18. [18]

    doi: 10.1037/a0014611

    ISSN 0022-3514. doi: 10.1037/a0014611. Richard P . Mattick and J.Christopher Clarke. Development and validation of measures of social phobia scrutiny fear and social interaction anxiety.Behaviour Research and Therapy, 36 (4):455–470,

  19. [19]

    doi: https://doi.org/10.1016/S0005-7967(97)10031-6

    ISSN 0005-7967. doi: https://doi.org/10.1016/S0005-7967(97)10031-6. URLhttps://www.sciencedirect.com/science/article/pii/S0005796797100316. Roger C. Mayer and James H. Davis. The effect of the performance appraisal system on trust for management: A field quasi-experiment.Journal of Applied Psychology, 84(1): 123–136, February

  20. [20]

    doi: 10.1037/0021-9010.84.1.123

    ISSN 1939-1854, 0021-9010. doi: 10.1037/0021-9010.84.1.123. URL https://doi.apa.org/doi/10.1037/0021-9010.84.1.123. Dan P . McAdams. The psychology of life stories. InReview of General Psychology, volume 5, pp. 100–122. SAGE,

  21. [21]

    doi: 10.1177/0963721413475622

    ISSN 0963-7214, 1467-8721. doi: 10.1177/0963721413475622. URLhttps://journals.sagepub.com/doi/10.1177/0963721413475622. Joshua R. Oltmanns, Ritik Khandelwal, Jerry Ma, Jocelyn Brickman, Tu Do, Rasiq Hussain, and Mehak Gupta. Language-based ai modeling of personality traits and pathology from life narrative interviews.Journal of Psychopathology and Clinica...

  22. [22]

    doi: 10.1037/abn0001047

    ISSN 2769-755X. doi: 10.1037/abn0001047. Daniel J. Ozer and Ver´onica Benet-Mart´ınez. Personality and the prediction of consequential outcomes.Annual Review of Psychology, 57:401–421,

  23. [23]

    doi: 10.1146/ annurev.psych.57.102904.190127

    ISSN 0066-4308. doi: 10.1146/ annurev.psych.57.102904.190127. 12 Preprint. Under review. Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior

  24. [24]

    O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S

    doi: 10.1145/3586183.3606763. URLhttps://doi.org/10.1145/3586183.3606763. Max Pellert, Clemens M. Lechner, Indira Sen, and Markus Strohmaier. Neural network embeddings recover value dimensions from psychometric survey items on par with human data,

  25. [25]

    URLhttps://arxiv.org/abs/2509.24906. Brent W. Roberts, Nathan R. Kuncel, Rebecca Shiner, Avshalom Caspi, and Lewis R. Gold- berg. The power of personality: The comparative validity of personality traits, socioeco- nomic status, and cognitive ability for predicting important life outcomes.Perspectives on Psychological Science, 2(4):313–345,

  26. [26]

    URL https://doi.org/10.1111/j.1745-6916.2007.00047.x

    doi: 10.1111/j.1745-6916.2007.00047.x. URL https://doi.org/10.1111/j.1745-6916.2007.00047.x. PMID: 26151971. Gerard Saucier and Lewis R. Goldberg. The language of personality: Lexical perspectives on the five-factor model. pp. 21–50,

  27. [27]

    A psychometric frame- work for evaluating and shaping personality traits in large language models.Nature Ma- chine Intelligence, 7(12):1954–1968,

    Gregory Serapio-Garc´ıa, Mustafa Safdari, Cl´ement Crepy, Luning Sun, Stephen Fitz, Peter Romero, Marwa Abdulhai, Aleksandra Faust, and Maja Matari´c. A psychometric frame- work for evaluating and shaping personality traits in large language models.Nature Ma- chine Intelligence, 7(12):1954–1968,

  28. [28]

    doi: 10.1038/s42256-025-01115-6

    ISSN 2522-5839. doi: 10.1038/s42256-025-01115-6. Quan Shi, Carlos E Jimenez, Stephen Dong, Brian Seo, Caden Yao, Adam Kelch, and Karthik R Narasimhan. IMPersona: Evaluating individual level LLM impersonation. In Second Conference on Language Modeling,

  29. [29]

    Aleksandra Sorokovikova, Sharwin Rezagholi, Natalia Fedorova, and Ivan P

    URL https://openreview.net/forum?id= 7qhBXq0NLN. Aleksandra Sorokovikova, Sharwin Rezagholi, Natalia Fedorova, and Ivan P . Yamshchikov. LLMs simulate big5 personality traits: Further evidence. In Ameet Deshpande, EunJeong Hwang, Vishvak Murahari, Joon Sung Park, Diyi Yang, Ashish Sabharwal, Karthik Narasimhan, and Ashwin Kalyan (eds.),Proceedings of the ...

  30. [30]

    doi: 10.18653/v1/2024.personalize-1.7

    Association for Computational Linguistics. doi: 10.18653/v1/2024.personalize-1.7. URL https://aclanthology.org/2024.personalize-1.7/. Andrew B. Speer, Angie Y. Delacruz, Takudzwa A. Chawota, James Perrotta, and Cort W. Rudolph. Unpacking the validity of open-ended personality assess- ments using fine-tuned large language models.Journal of Organizational R...

  31. [31]

    doi: 10.1177/10944281251413746

    ISSN 1094-4281, 1552-7425. doi: 10.1177/10944281251413746. URL https://journals.sagepub.com/doi/10.1177/ 10944281251413746. Simine Vazire. Who knows what about a person? the self-other knowledge asymmetry (SOKA) model.Journal of Personality and Social Psychology, 98(2):281–300, February

  32. [32]

    doi: 10.1037/a0017908

    ISSN 1939-1315. doi: 10.1037/a0017908. Pranav Narayanan Venkit et al. The need for a socially-grounded persona framework for user simulation

  33. [33]

    URLhttps://arxiv.org/abs/2601.07110. Aidan G. C. Wright et al. Assessing personality using zero-shot generative ai scoring of brief open-ended text.Nature Human Behaviour, January

  34. [34]

    Nature Human Behaviour pp

    ISSN 2397-3374. doi: 10.1038/s41562-025-02389-x. Lingfeng Zhou, Jialing Zhang, Jin Gao, Mohan Jiang, and Dequan Wang. Personaeval: Are LLM evaluators human enough to judge role-play? InSecond Conference on Language Modeling,

  35. [35]

    positive childhood memory

    yields lower recovery across all scorers: Sonnet-to-GPT-5.4 r= 0.670, Sonnet-to-Gemini r= 0.628, mean r= 0.650. This consistent underperformance (∆r≈ 0.10 vs. GPT-4.1 and Gemini 3 Flash) is consistent with the scoring- generation dissociation: RLHF compression attenuates personality signal during generation while preserving—and possibly enhancing—scoring ...

  36. [36]

    Mercury 2 is a diffusion language model; all other generators are autoregressive. Three tiers emerge: (1) frontier autoregressive generators converge at r= 0.72–0.75 re- gardless of architecture; (2) smaller/efficiency models and Mercury 2 (diffusion) maintain r= 0.68–0.71; (3) open-source or heavily safety-trained generators fall to r= 0.64–0.65. GPT-4.1...