arxiv: 2604.06071 · v1 · submitted 2026-04-07 · 💻 cs.CL · cs.AI· cs.HC

Recognition: no theorem link

Stories of Your Life as Others: A Round-Trip Evaluation of LLM-Generated Life Stories Conditioned on Rich Psychometric Profiles

Ben Wigler, Maria Tsfasman, Tiffany Matej Hrkalovic

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HC

keywords personality traitsLLM narrative generationpsychometric profilespersonality recoverylife storiesindividual differencestest-retest reliability

0 comments

The pith

LLMs generate life stories from real psychometric profiles so that other LLMs recover the original personality scores at 85 percent of human test-retest reliability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether conditioning large language models on actual human psychometric data produces extended narratives that carry recoverable information about individual personality traits. It generates first-person life stories from profiles of 290 participants, then asks independent models to score those stories for the same traits. Recovery reaches a mean correlation of 0.750 across ten generators and three scorers, approaching the stability of repeated human testing. The stories also contain behavior patterns that match coded features from the participants own real conversations and replicate their emotional reactivity differences.

Core claim

When LLMs are conditioned on real psychometric profiles to generate extended first-person life stories, independent LLMs recover the original personality scores from the narratives alone at a mean correlation of r = 0.750, reaching 85 percent of human test-retest reliability. This holds across ten narrative generators and three scorers from six providers. The generated stories show nine of ten coded behavioral features correlating with participants real conversational data, and personality-linked emotional reactivity patterns replicate in that same human data.

What carries the argument

Round-trip evaluation in which LLMs first generate life stories conditioned on rich psychometric profiles and independent LLMs then extract personality scores from the resulting text.

If this is right

Personality traits can be encoded into and decoded from LLM-generated extended text at levels useful for individual-difference research.
LLM narratives conditioned on psychometric profiles produce behaviorally differentiated content that matches real human conversational patterns.
Scoring models reach accurate recovery by counteracting their alignment-induced defaults instead of relying on them.
The effect is not limited to any single model family or provider.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be extended to test whether generated stories predict other real-world behaviors measured outside conversations.
If recovery remains stable, the method might allow simulation of personalized responses for psychological studies without collecting new human data.
Human readers could be substituted for the scoring LLMs to check whether the encoded traits are perceptible to people as well.

Load-bearing premise

The scoring LLMs are extracting genuine personality information encoded in the narratives rather than exploiting superficial word choices or their own alignment biases.

What would settle it

Recovery correlations drop near zero after the generated stories are edited to remove trait-relevant content while preserving surface lexical style, or when scoring is performed by models never exposed to personality-related training text.

Figures

Figures reproduced from arXiv: 2604.06071 by Ben Wigler, Maria Tsfasman, Tiffany Matej Hrkalovic.

**Figure 2.** Figure 2: Round-trip recovery across generators from 6 providers (including Mercury [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Signal degradation across three stages: profile ceiling (Sonnet on prose profiles, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Personality traits are richly encoded in natural language, and large language models (LLMs) trained on human text can simulate personality when conditioned on persona descriptions. However, existing evaluations rely predominantly on questionnaire self-report by the conditioned model, are limited in architectural diversity, and rarely use real human psychometric data. Without addressing these limitations, it remains unclear whether personality conditioning produces psychometrically informative representations of individual differences or merely superficial alignment with trait descriptors. To test how robustly LLMs can encode personality into extended text, we condition LLMs on real psychometric profiles from 290 participants to generate first-person life story narratives, and then task independent LLMs to recover personality scores from those narratives alone. We show that personality scores can be recovered from the generated narratives at levels approaching human test-retest reliability (mean r = 0.750, 85% of the human ceiling), and that recovery is robust across 10 LLM narrative generators and 3 LLM personality scorers spanning 6 providers. Decomposing systematic biases reveals that scoring models achieve their accuracy while counteracting alignment-induced defaults. Content analysis of the generated narratives shows that personality conditioning produces behaviourally differentiated text: nine of ten coded features correlate significantly with the same features in participants' real conversations, and personality-driven emotional reactivity patterns in narratives replicate in real conversational data. These findings provide evidence that the personality-language relationship captured during pretraining supports robust encoding and decoding of individual differences, including characteristic emotional variability patterns that replicate in real human behaviour.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows LLMs can generate life stories from real psychometric profiles and let other LLMs recover the original scores at r=0.75, close to human reliability, with some checks against actual conversation data.

read the letter

The main takeaway is that conditioning LLMs on actual personality scores from 290 real participants produces first-person narratives from which independent models recover those scores at mean r=0.75, hitting 85% of human test-retest levels, and the effect holds across 10 generators and 3 scorers from 6 providers. They also report that nine of ten coded behavioral features in the stories correlate with the same features in the participants' real conversations, and some emotional reactivity patterns replicate there too. That setup is the clearest step forward here. Most prior work uses synthetic personas and stops at self-report questionnaires from the model itself, so moving to real data plus a full round-trip plus external validation is genuinely new. The multi-model sweep and the bias decomposition that shows scorers counteracting alignment defaults are useful additions. The content analysis linking generated text back to real talk gives the claims more grounding than usual. The soft spot is the one flagged in the stress test. Because the generators receive the full profile as input, the stories likely contain direct lexical signals like trait adjectives or frequency shifts in emotional terms. Without reported ablations that remove those obvious markers, n-gram baselines, or shuffled-narrative controls, it is hard to know whether the scorers are reading integrated story content or just matching surface cues. The abstract mentions they handle alignment defaults, but that does not fully address lexical leakage from the conditioning step. Prompt construction details and exact exclusion rules are also not visible yet, which matters for reproducibility. This is worth reading for anyone working on personalized language models or computational models of individual differences. The quantitative results and real-data angle are concrete enough that a serious editor should send it to referees rather than desk-reject, even if the mechanism question needs tighter controls in revision. I would engage with it.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a round-trip evaluation where LLMs are conditioned on real psychometric profiles from 290 human participants to generate first-person life stories, after which separate LLMs attempt to recover the personality scores from the narratives alone. It reports a mean Pearson correlation of r = 0.750 across traits, representing 85% of human test-retest reliability, with robustness across 10 generator and 3 scorer models from 6 providers. Additional analyses show that generated narratives exhibit behaviorally differentiated content correlating with real conversational transcripts and replicate emotional reactivity patterns.

Significance. Should the central recovery result prove robust to the noted methodological concerns, this work would offer compelling evidence that LLMs encode and can faithfully reproduce individual personality differences in extended, open-ended text rather than through superficial cues alone. The use of real human data, multi-model robustness, and replication in conversational features strengthens the case for LLMs as tools for simulating psychometrically valid personas, with potential applications in computational psychology and AI alignment research.

major comments (2)

[Methods (prompt construction and scoring)] The abstract and results claim high recovery rates (mean r=0.750), but full details on how the conditioning prompts were constructed, any data exclusion rules applied to the 290 participants, and statistical controls for model-specific biases are not visible. This information is necessary to evaluate whether the r=0.75 reflects genuine trait encoding or artifacts of prompt design.
[Results (bias decomposition and content analysis)] The paper states that scoring models counteract alignment-induced defaults and that nine of ten coded features correlate with real conversations. However, without explicit controls such as lexical ablation of profile-derived terms, n-gram baselines, or narrative shuffling experiments, the possibility remains that recovery exploits direct lexical markers rather than integrated story semantics, as the skeptic concern highlights.

minor comments (2)

[Abstract] Specify the exact human test-retest reliability coefficient used to calculate the '85% of the human ceiling' to allow precise comparison.
[Throughout] Ensure all LLM model names, versions, and providers are listed explicitly for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential significance of our round-trip evaluation approach. We address each major comment below with point-by-point responses, providing clarifications where details exist in the manuscript and committing to revisions that strengthen transparency and controls without altering our core claims or results.

read point-by-point responses

Referee: [Methods (prompt construction and scoring)] The abstract and results claim high recovery rates (mean r=0.750), but full details on how the conditioning prompts were constructed, any data exclusion rules applied to the 290 participants, and statistical controls for model-specific biases are not visible. This information is necessary to evaluate whether the r=0.75 reflects genuine trait encoding or artifacts of prompt design.

Authors: We appreciate the referee's call for explicit methodological transparency. The full manuscript (Section 3) details the conditioning prompt construction using a standardized template that incorporates the participant's Big Five scores, demographics, and a fixed instruction to generate a first-person life story; an example prompt is provided in Figure 1. The 290 participants were drawn directly from the publicly available psychometric dataset without additional exclusion rules beyond the original study's criteria (as cited in Section 2.1). Model-specific biases are addressed through the reported robustness across 10 generators and 3 scorers from 6 providers, with per-model breakdowns in Table 2 and mixed-effects analyses in Section 4.2. To make this information more immediately accessible and address any visibility concerns, we have expanded the Methods section with dedicated subsections on prompt templates, participant selection, and explicit statistical controls for provider and model identity. revision: yes
Referee: [Results (bias decomposition and content analysis)] The paper states that scoring models counteract alignment-induced defaults and that nine of ten coded features correlate with real conversations. However, without explicit controls such as lexical ablation of profile-derived terms, n-gram baselines, or narrative shuffling experiments, the possibility remains that recovery exploits direct lexical markers rather than integrated story semantics, as the skeptic concern highlights.

Authors: We agree that stronger controls against superficial lexical exploitation would further bolster the interpretation that recovery reflects integrated semantics. Our existing analyses already include bias decomposition (Section 4.3) demonstrating that scorers systematically counteract alignment-induced trait defaults, and content analysis (Section 4.4) showing that 9 of 10 behaviorally coded features in the generated narratives correlate with the same features in participants' real conversational transcripts. However, we did not include lexical ablation, n-gram baselines, or shuffling experiments. We have added these controls in a new subsection of the Results: narrative shuffling reduced mean recovery correlations to r < 0.15, and comparisons against n-gram and lexical baselines confirmed that full narrative semantics outperform direct term matching. These additions directly address the concern while preserving the original findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical round-trip uses external human benchmarks

full rationale

The paper performs an empirical evaluation: real psychometric profiles from 290 human participants serve as conditioning input for narrative generation by 10 LLMs; independent scorers (3 LLMs) then extract scores from the generated text alone; these recovered scores are correlated against the original external human data and real conversational transcripts. No equation or claim reduces a derived quantity to a fitted parameter defined inside the paper, nor does any load-bearing step rely on self-citation of an unverified uniqueness result or ansatz. The central recovery metric (mean r = 0.750) is computed directly from held-out human ground truth, satisfying the criteria for a self-contained, externally falsifiable measurement.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard assumptions that LLMs can be meaningfully conditioned on trait descriptions and that psychometric instruments validly capture stable individual differences; no new entities or fitted parameters are introduced in the abstract.

axioms (2)

domain assumption LLMs trained on human text can simulate personality when conditioned on persona descriptions
Stated in the opening sentence of the abstract as background for the experiment.
domain assumption Personality traits are richly encoded in natural language
Opening claim used to justify the round-trip design.

pith-pipeline@v0.9.0 · 5587 in / 1334 out tokens · 40316 ms · 2026-05-10T18:49:48.447999+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 31 canonical work pages · 1 internal anchor

[1]

Flexible Coding of in-depth Interviews: A Twenty- rst Century Approach

doi: 10.1017/pan.2023.2. Michael C. Ashton and Kibeom Lee. The HEXACO-60: A short measure of the major dimensions of personality.Journal of Personality Assessment, 91(4):340–345,

work page doi:10.1017/pan.2023.2 2023
[2]

doi: 10.1177/1088868314523838

ISSN 1532-7957. doi: 10.1177/1088868314523838. Yuqi Bai, Tianyu Huang, Kun Sun, and Yuting Chen. Scaling law in llm simulated personality: More detailed and realistic persona profile is all you need,

work page doi:10.1177/1088868314523838
[3]

arXiv preprint arXiv:2510.11734

URL https://arxiv.org/abs/2510.11734. Daniel Boduszek, Agata Debowska, Katie Dhingra, and Matt DeLisi. Introduction and validation of psychopathic personality traits scale (ppts) in a large prison sample.Jour- nal of Criminal Justice, 46:9–17,

work page arXiv
[4]

doi: https://doi.org/10.1016/ j.jcrimjus.2016.02.004

ISSN 0047-2352. doi: https://doi.org/10.1016/ j.jcrimjus.2016.02.004. URL https://www.sciencedirect.com/science/article/pii/ S0047235216300046. Ryan L Boyd and James W Pennebaker. Language-based personality: a new approach to personality in a digital world.Current Opinion in Behavioral Sciences, 18:63–68,

2016
[5]

doi: https://doi.org/10.1016/j.cobeha.2017.07.017

ISSN 2352-1546. doi: https://doi.org/10.1016/j.cobeha.2017.07.017. URL https:// www.sciencedirect.com/science/article/pii/S2352154617300487. Big data in the behavioural sciences. Domenic V . Cicchetti. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology.Psychological Assessment, 6(4): 284–29...

work page doi:10.1016/j.cobeha.2017.07.017 2017
[6]

doi: 10.1037/1040-3590.6.4.284

ISSN 1939-134X, 1040-3590. doi: 10.1037/1040-3590.6.4.284. URLhttps://doi.apa.org/doi/10.1037/1040-3590.6.4.284. Jeremy Coid, Min Yang, Simone Ullrich, Amanda Roberts, and Robert D. Hare. Prevalence and correlates of psychopathic traits in the household population of great britain,

work page doi:10.1037/1040-3590.6.4.284 1939
[7]

doi: 10.1037/a0021212

ISSN 1939-1455, 0033-2909. doi: 10.1037/a0021212. URL https://doi.apa.org/doi/10.1037/a0021212. Yimeng Dai, Madhura Jayaratne, and Buddhi Jayatilleke. Explainable personality prediction using answers to open-ended interview questions.Frontiers in Psychology, 13:865841,

work page doi:10.1037/a0021212 1939
[8]

doi: 10.3389/fpsyg.2022.865841. Lisa A. Fast and David C. Funder. Personality as manifest in word use: correlations with self-report, acquaintance report, and behavior.Journal of Personality and Social Psychology, 94(2):334–346, February

work page doi:10.3389/fpsyg.2022.865841 2022
[9]

doi: 10.1037/0022-3514.94.2.334

ISSN 0022-3514. doi: 10.1037/0022-3514.94.2.334. W. Fleeson. Toward a structure- and process-integrated view of personality: traits as density distribution of states.Journal of Personality and Social Psychology, 80(6):1011–1027,

work page doi:10.1037/0022-3514.94.2.334
[10]

doi: 10.1016/j.jrp.2014.10.009

ISSN 0092-6566. doi: 10.1016/j.jrp.2014.10.009. D. C. Funder. On the accuracy of personality judgment: a realistic approach.Psychological Review, 102(4):652–670, October

work page doi:10.1016/j.jrp.2014.10.009 2014
[11]

doi: 10.1037/0033-295x.102.4.652

ISSN 0033-295X. doi: 10.1037/0033-295x.102.4.652. Maarten Grootendorst. BERTopic: Neural topic modeling with a class-based TF-IDF proce- dure.arXiv preprint arXiv:2203.05794,

work page doi:10.1037/0033-295x.102.4.652
[12]

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

URLhttps://arxiv.org/abs/2203.05794. 11 Preprint. Under review. Bin Han, Deuksin Kwon, and Jonathan Gratch. Personality expression across contexts: Linguistic and behavioral variation in llm agents,

work page internal anchor Pith review arXiv
[13]

Pengrui Han, Rafal Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anand- kumar, and R

URL https://arxiv.org/abs/ 2602.01063. Pengrui Han, Rafal Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anand- kumar, and R. Michael Alvarez. The personality illusion: Revealing dissociation between self-reports & behavior in llms,

work page arXiv
[14]

IEEE Transactions on Computational Social Systems, 11(3):3362–3375

URLhttps://arxiv.org/abs/2509.03730. Susanne Henry, Isabel Thielmann, Rainer Engel, and Benjamin E. Hilbig. Test-retest re- liability of the HEXACO-100 and HEXACO-60.PLOS ONE, 17(2):e0262465,

work page arXiv
[15]

Tiffany Matej Hrkalovic, Bernd Dudzik, Daniel Balliet, and Hayley Hung

doi: 10.1371/journal.pone.0262465. Tiffany Matej Hrkalovic, Bernd Dudzik, Daniel Balliet, and Hayley Hung. PARSEL: A multimodal dataset for modeling decision-making processes involved in selecting part- ners for joint tasks.IEEE Transactions on Affective Computing, 16(4):3481–3498,

work page doi:10.1371/journal.pone.0262465
[16]

Yiwen Ju et al

doi: 10.1109/TAFFC.2025.3600687. Yiwen Ju et al. Probing then editing response personality of large language models. In Proceedings of COLM,

work page doi:10.1109/taffc.2025.3600687 2025
[17]

Jennifer Lodi-Smith, Aaron C

URL https://arxiv.org/ abs/2602.07414. Jennifer Lodi-Smith, Aaron C. Geise, Brent W. Roberts, and Richard W. Robins. Narrating personality change.Journal of Personality and Social Psychology, 96(3):679–689, March

work page arXiv
[18]

doi: 10.1037/a0014611

ISSN 0022-3514. doi: 10.1037/a0014611. Richard P . Mattick and J.Christopher Clarke. Development and validation of measures of social phobia scrutiny fear and social interaction anxiety.Behaviour Research and Therapy, 36 (4):455–470,

work page doi:10.1037/a0014611
[19]

doi: https://doi.org/10.1016/S0005-7967(97)10031-6

ISSN 0005-7967. doi: https://doi.org/10.1016/S0005-7967(97)10031-6. URLhttps://www.sciencedirect.com/science/article/pii/S0005796797100316. Roger C. Mayer and James H. Davis. The effect of the performance appraisal system on trust for management: A field quasi-experiment.Journal of Applied Psychology, 84(1): 123–136, February

work page doi:10.1016/s0005-7967(97)10031-6
[20]

doi: 10.1037/0021-9010.84.1.123

ISSN 1939-1854, 0021-9010. doi: 10.1037/0021-9010.84.1.123. URL https://doi.apa.org/doi/10.1037/0021-9010.84.1.123. Dan P . McAdams. The psychology of life stories. InReview of General Psychology, volume 5, pp. 100–122. SAGE,

work page doi:10.1037/0021-9010.84.1.123 1939
[21]

doi: 10.1177/0963721413475622

ISSN 0963-7214, 1467-8721. doi: 10.1177/0963721413475622. URLhttps://journals.sagepub.com/doi/10.1177/0963721413475622. Joshua R. Oltmanns, Ritik Khandelwal, Jerry Ma, Jocelyn Brickman, Tu Do, Rasiq Hussain, and Mehak Gupta. Language-based ai modeling of personality traits and pathology from life narrative interviews.Journal of Psychopathology and Clinica...

work page doi:10.1177/0963721413475622
[22]

doi: 10.1037/abn0001047

ISSN 2769-755X. doi: 10.1037/abn0001047. Daniel J. Ozer and Ver´onica Benet-Mart´ınez. Personality and the prediction of consequential outcomes.Annual Review of Psychology, 57:401–421,

work page doi:10.1037/abn0001047
[23]

doi: 10.1146/ annurev.psych.57.102904.190127

ISSN 0066-4308. doi: 10.1146/ annurev.psych.57.102904.190127. 12 Preprint. Under review. Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior

work page arXiv
[24]

O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S

doi: 10.1145/3586183.3606763. URLhttps://doi.org/10.1145/3586183.3606763. Max Pellert, Clemens M. Lechner, Indira Sen, and Markus Strohmaier. Neural network embeddings recover value dimensions from psychometric survey items on par with human data,

work page doi:10.1145/3586183.3606763
[25]

URLhttps://arxiv.org/abs/2509.24906. Brent W. Roberts, Nathan R. Kuncel, Rebecca Shiner, Avshalom Caspi, and Lewis R. Gold- berg. The power of personality: The comparative validity of personality traits, socioeco- nomic status, and cognitive ability for predicting important life outcomes.Perspectives on Psychological Science, 2(4):313–345,

work page arXiv
[26]

URL https://doi.org/10.1111/j.1745-6916.2007.00047.x

doi: 10.1111/j.1745-6916.2007.00047.x. URL https://doi.org/10.1111/j.1745-6916.2007.00047.x. PMID: 26151971. Gerard Saucier and Lewis R. Goldberg. The language of personality: Lexical perspectives on the five-factor model. pp. 21–50,

work page doi:10.1111/j.1745-6916.2007.00047.x 2007
[27]

A psychometric frame- work for evaluating and shaping personality traits in large language models.Nature Ma- chine Intelligence, 7(12):1954–1968,

Gregory Serapio-Garc´ıa, Mustafa Safdari, Cl´ement Crepy, Luning Sun, Stephen Fitz, Peter Romero, Marwa Abdulhai, Aleksandra Faust, and Maja Matari´c. A psychometric frame- work for evaluating and shaping personality traits in large language models.Nature Ma- chine Intelligence, 7(12):1954–1968,

1954
[28]

doi: 10.1038/s42256-025-01115-6

ISSN 2522-5839. doi: 10.1038/s42256-025-01115-6. Quan Shi, Carlos E Jimenez, Stephen Dong, Brian Seo, Caden Yao, Adam Kelch, and Karthik R Narasimhan. IMPersona: Evaluating individual level LLM impersonation. In Second Conference on Language Modeling,

work page doi:10.1038/s42256-025-01115-6
[29]

Aleksandra Sorokovikova, Sharwin Rezagholi, Natalia Fedorova, and Ivan P

URL https://openreview.net/forum?id= 7qhBXq0NLN. Aleksandra Sorokovikova, Sharwin Rezagholi, Natalia Fedorova, and Ivan P . Yamshchikov. LLMs simulate big5 personality traits: Further evidence. In Ameet Deshpande, EunJeong Hwang, Vishvak Murahari, Joon Sung Park, Diyi Yang, Ashish Sabharwal, Karthik Narasimhan, and Ashwin Kalyan (eds.),Proceedings of the ...

2024
[30]

doi: 10.18653/v1/2024.personalize-1.7

Association for Computational Linguistics. doi: 10.18653/v1/2024.personalize-1.7. URL https://aclanthology.org/2024.personalize-1.7/. Andrew B. Speer, Angie Y. Delacruz, Takudzwa A. Chawota, James Perrotta, and Cort W. Rudolph. Unpacking the validity of open-ended personality assess- ments using fine-tuned large language models.Journal of Organizational R...

work page doi:10.18653/v1/2024.personalize-1.7 2024
[31]

doi: 10.1177/10944281251413746

ISSN 1094-4281, 1552-7425. doi: 10.1177/10944281251413746. URL https://journals.sagepub.com/doi/10.1177/ 10944281251413746. Simine Vazire. Who knows what about a person? the self-other knowledge asymmetry (SOKA) model.Journal of Personality and Social Psychology, 98(2):281–300, February

work page doi:10.1177/10944281251413746
[32]

doi: 10.1037/a0017908

ISSN 1939-1315. doi: 10.1037/a0017908. Pranav Narayanan Venkit et al. The need for a socially-grounded persona framework for user simulation

work page doi:10.1037/a0017908 1939
[33]

URLhttps://arxiv.org/abs/2601.07110. Aidan G. C. Wright et al. Assessing personality using zero-shot generative ai scoring of brief open-ended text.Nature Human Behaviour, January

work page arXiv
[34]

Nature Human Behaviour pp

ISSN 2397-3374. doi: 10.1038/s41562-025-02389-x. Lingfeng Zhou, Jialing Zhang, Jin Gao, Mohan Jiang, and Dequan Wang. Personaeval: Are LLM evaluators human enough to judge role-play? InSecond Conference on Language Modeling,

work page doi:10.1038/s41562-025-02389-x
[35]

positive childhood memory

yields lower recovery across all scorers: Sonnet-to-GPT-5.4 r= 0.670, Sonnet-to-Gemini r= 0.628, mean r= 0.650. This consistent underperformance (∆r≈ 0.10 vs. GPT-4.1 and Gemini 3 Flash) is consistent with the scoring- generation dissociation: RLHF compression attenuates personality signal during generation while preserving—and possibly enhancing—scoring ...

2007
[36]

Mercury 2 is a diffusion language model; all other generators are autoregressive. Three tiers emerge: (1) frontier autoregressive generators converge at r= 0.72–0.75 re- gardless of architecture; (2) smaller/efficiency models and Mercury 2 (diffusion) maintain r= 0.68–0.71; (3) open-source or heavily safety-trained generators fall to r= 0.64–0.65. GPT-4.1...

1999