pith. sign in

arxiv: 2605.30058 · v1 · pith:PEDG4SC5new · submitted 2026-05-28 · 💻 cs.CL

HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?

Pith reviewed 2026-06-29 07:16 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM agentsBig Five personalityepisodic memoriesDIAMONDS taxonomypersonality consistencydecision-making scenariospsychology benchmark
0
0 comments X

The pith

HEART-Bench tests whether LLM agents make decisions consistent with assigned Big Five personality traits and 1,000 episodic memories across 64 scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HEART-Bench to check if LLM agents can exhibit coherent human-like psychology by maintaining personality consistency and value-aligned choices. It builds 11 characters from orthogonal Big Five traits deeply integrated with structured autobiographical memories spanning developmental stages. Agents face 64 decision-making scenarios drawn from the DIAMONDS taxonomy, which covers eight situational dimensions including Duty, Intellect, Adversity, and Sociality. The resulting 673 validated multiple-choice questions measure whether choices align with each character's specific profile. A sympathetic reader would care because this supplies a concrete way to evaluate psychological simulation beyond isolated task performance.

Core claim

The benchmark constructs 11 diverse human characters grounded in orthogonal Big Five personality traits, with each profile integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. It subjects agents to 64 decision-making scenarios guided by the DIAMONDS taxonomy and evaluates whether agents consolidate their traits and memories to produce behavioral decisions consistent with their psychological profiles, yielding 673 multiple-choice questions after human validation.

What carries the argument

The DIAMONDS taxonomy of eight situational dimensions used to curate the 64 scenarios that probe whether decisions remain consistent with pre-assigned personality profiles and episodic memories.

If this is right

  • Agents can be evaluated for the ability to consolidate personality traits with autobiographical memories when facing varied situations.
  • The method supplies a scalable way to study value-consistent behavioral decision-making in LLM agents.
  • Human-validated multiple-choice questions allow repeatable measurement of personality consistency and emotional dimensions.
  • The benchmark treats emotional dimensions as equal in importance to task-oriented abilities such as planning and reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same construction could be reused to track whether larger models maintain profile consistency better than smaller ones.
  • Conflicts deliberately introduced between memories and traits could test whether agents resolve value tensions in human-like ways.
  • Results might inform whether current alignment techniques produce stable psychological profiles or only surface-level mimicry.

Load-bearing premise

That consistency between an agent's choices in the scenarios and its pre-assigned personality profile plus memories counts as evidence of coherent human-like psychology.

What would settle it

Observing that agents assigned different personality profiles produce statistically similar choice patterns across the 64 scenarios would show the benchmark fails to detect profile-specific psychology.

read the original abstract

While LLM agents have demonstrated remarkable task-oriented abilities such as planning, reasoning, and action, few works have treated them as complete human personalities where emotional dimensions hold equal importance. In this paper, we introduce a novel benchmark to systematically assess whether LLM agents can simulate coherent, human-like psychology. Specifically, our benchmark constructs 11 diverse human characters grounded in orthogonal Big Five personality traits, with each profile deeply integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. To rigorously evaluate the psychological manifestations of LLMs, we designed a curated suite of 64 decision-making scenarios, guided by the DIAMONDS taxonomy, a psychological framework that characterizes situations along eight dimensions: Duty, Intellect, Adversity, Mating, pOsitivity, Negativity, Deception, and Sociality. By subjecting agents to varying scenarios, the benchmark evaluates whether they can consolidate their innate personality traits and autobiographical memories to make behavioral decisions that are consistent with their specific psychological profiles. After systematic human validation and filtering, we obtained a benchmark consisting of 673 multiple-choice questions (MCQs). We believe this benchmark provides a principled and scalable testbed for studying human-like emotions, personality consistency, and value-consistent behavioural decision-making in LLM-based agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces HEART-Bench, a benchmark consisting of 11 characters grounded in orthogonal Big Five personality traits integrated with 1,000 episodic memories each, evaluated via 64 DIAMONDS scenarios to produce 673 MCQs after human validation. The central claim is that this provides a principled testbed for whether LLM agents can consolidate personality traits and memories into behaviorally consistent decisions that reflect human-like psychology.

Significance. If the benchmark can be shown to isolate psychological simulation from instruction-following, it would offer a scalable resource for studying personality consistency and value-based decisions in LLM agents. The current design, however, does not yet demonstrate this isolation, limiting the significance of the contribution to the field.

major comments (2)
  1. [Abstract, evaluation design paragraph] Abstract, evaluation design paragraph: the claim that consistency between agent choices on the 64 DIAMONDS scenarios and the supplied Big-Five profiles plus 1,000 episodic memories constitutes evidence of coherent, human-like psychology lacks any described controls (e.g., profile ablation, conflicting memories, or comparison against human response distributions) that would distinguish this from simple prompt parroting.
  2. [Abstract] Abstract: the manuscript describes benchmark construction and human validation but supplies no empirical results on LLM performance, no error analysis, and no evidence that the MCQs actually elicit or measure the claimed psychological consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the benchmark design and scope. We respond to each major comment below and indicate planned revisions to address the concerns while preserving the paper's focus on benchmark construction and validation.

read point-by-point responses
  1. Referee: [Abstract, evaluation design paragraph] Abstract, evaluation design paragraph: the claim that consistency between agent choices on the 64 DIAMONDS scenarios and the supplied Big-Five profiles plus 1,000 episodic memories constitutes evidence of coherent, human-like psychology lacks any described controls (e.g., profile ablation, conflicting memories, or comparison against human response distributions) that would distinguish this from simple prompt parroting.

    Authors: We agree that the current text overstates the evidential value of consistency without controls. The benchmark design uses orthogonal traits and 1,000 memories per character to increase the difficulty of pure prompt parroting, but this does not substitute for explicit ablations. In revision we will (1) soften the abstract and evaluation paragraphs to describe the benchmark as enabling tests of psychological consistency rather than directly constituting evidence, (2) add a limitations subsection that explicitly discusses the lack of profile ablation, conflicting-memory, and human-distribution controls, and (3) outline planned follow-up experiments using those controls. revision: yes

  2. Referee: [Abstract] Abstract: the manuscript describes benchmark construction and human validation but supplies no empirical results on LLM performance, no error analysis, and no evidence that the MCQs actually elicit or measure the claimed psychological consistency.

    Authors: The manuscript's primary contribution is the construction and human validation of HEART-Bench; no LLM evaluations or error analyses are reported because the work stops at releasing the testbed. We acknowledge that including baseline LLM results would help demonstrate the benchmark's utility. In the revised version we will add a short experimental section reporting performance of 2-3 representative LLMs on the 673 MCQs together with a basic error analysis, while clearly labeling these results as preliminary. revision: yes

Circularity Check

0 steps flagged

No significant circularity; externally constructed benchmark with no fitted predictions or self-referential derivations

full rationale

The paper introduces HEART-Bench by constructing 11 character profiles from orthogonal Big Five traits, integrating 1,000 author-generated episodic memories, and curating 64 DIAMONDS-guided scenarios into 673 MCQs. Evaluation measures behavioral consistency with the supplied profiles and memories. This construction is explicit and external to any prior fitted parameters or equations in the paper. No derivation chain exists that reduces a claimed prediction to its inputs by construction (e.g., no parameter fitting followed by renamed prediction, no self-citation load-bearing a uniqueness theorem, no ansatz smuggled via citation). The central claim concerns the benchmark's utility as a testbed rather than deriving a result equivalent to its inputs. Self-citations, if present, are not load-bearing for the benchmark design itself. This matches the default expectation of no circularity for a benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the validity of two established psychological frameworks and the untested assumption that MCQ consistency equals human-like psychology. No free parameters are introduced. The benchmark itself is the primary invented entity but carries no independent evidence beyond the paper's construction.

axioms (2)
  • domain assumption Big Five personality traits are orthogonal and suitable for constructing distinct human characters
    Invoked to ground the 11 profiles in the abstract
  • domain assumption DIAMONDS taxonomy accurately characterizes situations for eliciting psychological responses
    Used to guide the 64 decision-making scenarios
invented entities (1)
  • HEART-Bench no independent evidence
    purpose: Testbed for LLM psychological consistency
    The benchmark is the paper's main contribution

pith-pipeline@v0.9.1-grok · 5781 in / 1419 out tokens · 34721 ms · 2026-06-29T07:16:35.138416+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

107 extracted references · 15 canonical work pages · 8 internal anchors

  1. [1]

    Introducing Claude Opus 4.7.https://www.anthropic.com/news/claude-opus-4-7 , 2026

    Anthropic. Introducing Claude Opus 4.7.https://www.anthropic.com/news/claude-opus-4-7 , 2026. Accessed: 2026-05-06

  2. [2]

    Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, 2026

    OpenAI. Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, 2026. Accessed: 2026-05-06

  3. [3]

    Gemini 3.1 Pro: A smarter model for your most complex tasks.https://blog.google/innovation-a nd-ai/models-and-research/gemini-models/gemini-3-1-pro/, 2026

    Google DeepMind. Gemini 3.1 Pro: A smarter model for your most complex tasks.https://blog.google/innovation-a nd-ai/models-and-research/gemini-models/gemini-3-1-pro/, 2026. Accessed: 2026-05-06

  4. [4]

    Susan T Fiske, Amy JC Cuddy, Peter Glick, and Jun Xu

    Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, Aili Chen, Nianqi Li, Lida Chen, Caiyu Hu, Siye Wu, Scott Ren, Ziquan Fu, and Yanghua Xiao. From persona to personalization: A survey on role-playing language agents.arXiv preprint arXiv:2404.18231, 2024

  5. [5]

    Two tales of persona in llms: A survey of role-playing and personalization

    Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, and Yun-Nung Chen. Two tales of persona in llms: A survey of role-playing and personalization. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 16612–16631, 2024

  6. [6]

    Loneliness and suicide mitigation for students using gpt3-enabled chatbots.npj mental health research, 3(1):4, 2024

    Bethanie Maples, Merve Cerit, Aditya Vishwanath, and Roy Pea. Loneliness and suicide mitigation for students using gpt3-enabled chatbots.npj mental health research, 3(1):4, 2024

  7. [7]

    Investigating affective use and emotional well-being on chatgpt.arXiv preprint arXiv:2504.03888, 2025

    Jason Phang, Michael Lampe, Lama Ahmad, Sandhini Agarwal, Cathy Mengying Fang, Auren R Liu, Valdemar Danry, Eunhae Lee, Samantha WT Chan, Pat Pataranutaporn, et al. Investigating affective use and emotional well-being on chatgpt.arXiv preprint arXiv:2504.03888, 2025

  8. [8]

    Ai-native memory 2.0: Second me.arXiv preprint arXiv:2503.08102, 2025

    Jiale Wei, Xiang Ying, Tao Gao, Fangyi Bao, Felix Tao, and Jingbo Shang. Ai-native memory 2.0: Second me.arXiv preprint arXiv:2503.08102, 2025

  9. [9]

    Creatingtext-basedaiclonesofmyself: Exploringperceptions, development strategies, and challenges.International Journal of Human-Computer Studies, 208:103692, 2026

    DonggunLee,SuyounLee,HyunseungLim,andHwajungHong. Creatingtext-basedaiclonesofmyself: Exploringperceptions, development strategies, and challenges.International Journal of Human-Computer Studies, 208:103692, 2026

  10. [10]

    University of Chicago Press, 2000

    George E Marcus, W Russell Neuman, and Michael MacKuen.Affective intelligence and political judgment. University of Chicago Press, 2000

  11. [11]

    Emotional intelligence.Imagination, cognition and personality, 9(3):185–211, 1990

    Peter Salovey and John D Mayer. Emotional intelligence.Imagination, cognition and personality, 9(3):185–211, 1990

  12. [12]

    Seeing one’s self: Locating narrative memory in a framework of personality.Journal of Personality, 63(3):429–457, 1995

    Jofferson A Singer. Seeing one’s self: Locating narrative memory in a framework of personality.Journal of Personality, 63(3):429–457, 1995

  13. [13]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    PatrickLewis,EthanPerez,AleksandraPiktus,FabioPetroni,VladimirKarpukhin,NamanGoyal,HeinrichKüttler,MikeLewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  14. [14]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

  15. [15]

    Persona-db: Efficient large language model personalization for response prediction with collaborative data refinement

    Chenkai Sun, Ke Yang, Revanth Gangi Reddy, Yi Fung, Hou Pong Chan, Kevin Small, ChengXiang Zhai, and Heng Ji. Persona-db: Efficient large language model personalization for response prediction with collaborative data refinement. In Proceedings of the 31st International Conference on Computational Linguistics, pages 281–296, 2025

  16. [16]

    Charactereval: Achinesebenchmarkfor role-playing conversational agent evaluation

    QuanTu,ShilongFan, ZihangTian, TianhaoShen,ShuoShang,XinGao, andRuiYan. Charactereval: Achinesebenchmarkfor role-playing conversational agent evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11836–11850, 2024. 10

  17. [17]

    Scaling Synthetic Data Creation with 1,000,000,000 Personas

    Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas.arXiv preprint arXiv:2406.20094, 2024

  18. [18]

    Evaluating very long-termconversationalmemoryofllmagents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-termconversationalmemoryofllmagents. InProceedingsofthe62ndAnnualMeetingoftheAssociationforComputational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

  19. [19]

    Clonemem: Benchmarking long-term memory for ai clones.arXiv preprint arXiv:2601.07023, 2026

    SenHu,ZhiyuZhang,YuxiangWei,XueranHan,ZhenhengTang,HuacanWang,andRonghaoChen. Clonemem: Benchmarking long-term memory for ai clones.arXiv preprint arXiv:2601.07023, 2026

  20. [20]

    Personalens: A benchmark for personalization evaluation in conversational ai assistants

    Zheng Zhao, Clara Vania, Subhradeep Kayal, Naila Khan, Shay B Cohen, and Emine Yilmaz. Personalens: A benchmark for personalization evaluation in conversational ai assistants. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18023–18055, 2025

  21. [21]

    Know me, respond to me: Benchmarking LLMs for dynamic user profiling and personalized responses at scale

    Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J Taylor, and Dan Roth. Know me, respond to me: Benchmarking llms for dynamic user profiling and personalized responses at scale.arXiv preprint arXiv:2504.14225, 2025

  22. [22]

    An introduction to the five-factor model and its applications.Journal of personality, 60(2):175–215, 1992

    Robert R McCrae and Oliver P John. An introduction to the five-factor model and its applications.Journal of personality, 60(2):175–215, 1992

  23. [23]

    Personality trait structure as a human universal.American Psychologist, 52(5):509–516, 1997

    Robert R McCrae and Paul T Costa. Personality trait structure as a human universal.American Psychologist, 52(5):509–516, 1997

  24. [24]

    The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes

    Brent W Roberts, Nathan R Kuncel, Rebecca Shiner, Avshalom Caspi, and Lewis R Goldberg. The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. Perspectives on Psychological Science, 2(4):313–345, 2007

  25. [25]

    Toward a structure- and process-integrated view of personality: Traits as density distributions of states

    William Fleeson. Toward a structure- and process-integrated view of personality: Traits as density distributions of states. Journal of Personality and Social Psychology, 80(6):1011–1027, 2001

  26. [26]

    The construction of autobiographical memories in the self-memory system.Psychological review, 107(2):261, 2000

    Martin A Conway and Christopher W Pleydell-Pearce. The construction of autobiographical memories in the self-memory system.Psychological review, 107(2):261, 2000

  27. [27]

    Young, Janet S

    Jeffrey E. Young, Janet S. Klosko, and Marjorie E. Weishaar.Schema Therapy: A Practitioner’s Guide. Guilford Press, 2006

  28. [28]

    Character-LLM: A trainable agent for role-playing

    Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. Character-LLM: A trainable agent for role-playing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13153–13187, 2023

  29. [29]

    PersonaLLM: Investigating the ability of large language models to express personality traits

    Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, and Jad Kabbara. PersonaLLM: Investigating the ability of large language models to express personality traits. InFindings of the association for computational linguistics: NAACL 2024, pages 3605–3627, 2024

  30. [30]

    PersonaAgent: Bridging Memory and Action for Personalized LLM Agents

    Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, Xiaoman Pan, Lian Xiong, Jingguo Liu, Philip S. Yu, and Xian Li. PersonaAgent: When large language model agents meet personalization at test time.arXiv preprint arXiv:2506.06254, 2025

  31. [31]

    RGMem: Renormalization Group-inspired Memory Evolution for Language Agents

    Ao Tian, Yunfeng Lu, Xinxin Fan, Changhao Wang, Lanzhi Zhou, Yeyao Zhang, and Yanfang Liu. Rgmem: Renormalization group-based memory evolution for language agent user profile.arXiv preprint arXiv:2510.16392, 2025

  32. [32]

    Memgpt: towards llms as operating systems

    Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems. 2023

  33. [33]

    Hipporag: Neurobiologically inspired long-term memory for large language models.Advances in neural information processing systems, 37:59532–59569, 2024

    Bernal J Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobiologically inspired long-term memory for large language models.Advances in neural information processing systems, 37:59532–59569, 2024

  34. [34]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for LLM agents. arXiv preprint arXiv:2502.12110, 2025

  35. [35]

    Mood and memory.American psychologist, 36(2):129, 1981

    Gordon H Bower. Mood and memory.American psychologist, 36(2):129, 1981

  36. [36]

    InProceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+ CSS), pages 218–227

    Greg Serapio-García, Mustafa Safdari, Clément Crepy, Luning Sun, Stephen Fitz, Peter Romero, Marwa Abdulhai, Aleksandra Faust, and Maja Matarić. Personality traits in large language models.arXiv preprint arXiv:2307.00184, 2023

  37. [37]

    The psychology of life stories.Review of general psychology, 5(2):100–122, 2001

    Dan P McAdams. The psychology of life stories.Review of general psychology, 5(2):100–122, 2001. 11

  38. [38]

    Developing a life story: Constructing relations between self and experience in autobiographical narratives.Human development, 50(2-3):85–110, 2007

    Monisha Pasupathi, Emma Mansour, and Jed R Brubaker. Developing a life story: Constructing relations between self and experience in autobiographical narratives.Human development, 50(2-3):85–110, 2007

  39. [39]

    LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

    DiWu,HongweiWang,WenhaoYu,YuweiZhang,Kai-WeiChang,andDongYu. Longmemeval: Benchmarkingchatassistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

  40. [40]

    Realmem: Benchmarking LLMs in real-world memory-driven interaction.arXiv preprint arXiv:2601.06966, 2026

    Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, and Ronghao Chen. Realmem: Benchmarking LLMs in real-world memory-driven interaction.arXiv preprint arXiv:2601.06966, 2026

  41. [41]

    KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

    TingyuWu,ZhishengChen,ZiyanWeng,ShuheWang,ChenglongLi,ShuoZhang,SenHu,SilinWu,QizhenLan,HuacanWang, et al. Knowme-bench: Benchmarking person understanding for lifelong digital companions.arXiv preprint arXiv:2601.04745, 2026

  42. [42]

    The situational eight diamonds: A taxonomy of major dimensions of situation characteristics.Journal of Personality and Social Psychology, 107(4):677–718, 2014

    John F Rauthmann, David Gallardo-Pujol, Emmanuel M Guillaume, Emma Todd, Christopher S Nave, Robert A Sherman, Matthias Ziegler, and David C Funder. The situational eight diamonds: A taxonomy of major dimensions of situation characteristics.Journal of Personality and Social Psychology, 107(4):677–718, 2014

  43. [43]

    Thebasic-systemsmodelofautobiographicalmemory.PerspectivesonPsychologicalScience, 1(1):3–11, 2006

    DavidC.Rubin. Thebasic-systemsmodelofautobiographicalmemory.PerspectivesonPsychologicalScience, 1(1):3–11, 2006

  44. [44]

    Erikson.Childhood and Society

    Erik H. Erikson.Childhood and Society. W. W. Norton & Company, New York, 2nd edition, 1963

  45. [45]

    Levinson, Charlotte N

    Daniel J. Levinson, Charlotte N. Darrow, Edward B. Klein, Maria H. Levinson, and Braxton McKee.The Seasons of a Man’s Life. Knopf, New York, 1978

  46. [46]

    Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries

    Shalom H Schwartz. Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries. InAdvances in experimental social psychology, volume 25, pages 1–65. Elsevier, 1992

  47. [47]

    Levinson

    Daniel J. Levinson. A conception of adult development.American Psychologist, 41(1):3–13, 1986

  48. [48]

    Mitigating gender bias in natural language processing: Literature review

    Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang. Mitigating gender bias in natural language processing: Literature review. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 1630–1640, 2019

  49. [49]

    Introducing Claude Opus 4.6.https://www.anthropic.com/news/claude-opus-4-6 , 2026

    Anthropic. Introducing Claude Opus 4.6.https://www.anthropic.com/news/claude-opus-4-6 , 2026. Accessed: 2026-05-06

  50. [50]

    Deepseek-v3.2: Pushing the frontier of open large language models, 2025

    DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, 2025

  51. [51]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  52. [52]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

  53. [53]

    Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026

    OpenAI. Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026. Accessed: 2026-05-06

  54. [54]

    Introducing Claude Haiku 4.5.https://www.anthropic.com/news/claude-haiku-4-5 , 2025

    Anthropic. Introducing Claude Haiku 4.5.https://www.anthropic.com/news/claude-haiku-4-5 , 2025. Accessed: 2026-05-06

  55. [55]

    Introducing Claude Sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026

    Anthropic. Introducing Claude Sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026. Accessed: 2026-05-06

  56. [56]

    Gemini 3 Flash model card.https://blog.google/products-and-platforms/products/gemini /gemini-3-flash/, 2025

    Google DeepMind. Gemini 3 Flash model card.https://blog.google/products-and-platforms/products/gemini /gemini-3-flash/, 2025. Accessed: 2026-05-06

  57. [57]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025. 12 Appendices A Benchmark Comparison Dimensions......................................

  58. [58]

    Character constraints: full Big Five vector, occupation, self-value logic, and all eight core behavioural patterns from Appendix B.1

  59. [59]

    Developmental context: target age window, life-stage label, and the core developmental task from Table 8 (Appendix B.6)

  60. [60]

    failed a job interview

    Rubincoverage: per-dimensionguidancerequiringorganicintegrationofallfivedimensionswithoutexplicitsection labels. Unlike semantic memory—which stores abstract, decontextualized facts (e.g., “failed a job interview”)—each content_full narrative must encode information that collectively reconstructs the subjective texture of the experience and is irreducible...

  61. [61]

    years later,

    Perspective Drift (Omniscient Narrator).Violating the first-person narrative constraint. Detected when the narrator uses post-hoc knowledge or future-tense markers (e.g., “years later,” “I didn’t know then”). This violates the strict “in-the-moment” constraint. 2.Perspective Drift (Adult Evaluation).Specifically flags child-age memories (≤12) that use adu...

  62. [62]

    context exceeding 60 characters orcontent_summary exceeding 100 characters, indicating narrative leakage into metadata fields

    Field-Length Inversion. context exceeding 60 characters orcontent_summary exceeding 100 characters, indicating narrative leakage into metadata fields

  63. [63]

    This is detected using a cosine similarity threshold of>0.85over TF-IDF vector representations of the narratives

    Semantic Duplication.Redundant life episodes within the same developmental age window. This is detected using a cosine similarity threshold of>0.85over TF-IDF vector representations of the narratives

  64. [64]

    Length Violation.content_fullbelow 2,000 characters (insufficient Rubin coverage) or above 4,500 characters (excessive verbosity)

  65. [65]

    abandonment schema

    Schema-Vocabulary Inflation.Use of high-register psychoanalytic terms (e.g., “abandonment schema”, “dysregula- tion”) in low-intensity episodes (<0.4), indicating over-psychologisation. Repair procedure.Flagged memories enter a targeted repair loop: only the defective field(s) are regenerated using an issue-specificinstructionappendedtotheoriginalgenerati...

  66. [66]

    Any flagged scenario is revised by consensus before entering the second pass

    Logical and psychological screening.Two doctoral-level psychology researchers jointly check each scenario for logical consistency and psychological plausibility. Any flagged scenario is revised by consensus before entering the second pass

  67. [67]

    SCN_ENTERING_MIDLIFE_5

    Independent rating and conflict resolution.Both reviewers independently rate each scenario on the eight DIA- MONDS dimensions (1–7 Likert) and identify the scenario’s dominant Schwartz value-conflict pair. Disagreements above 1 point trigger joint re-assessment. We do not require a uniform DIAMONDS rating distribution; the goal is that each dimension be e...

  68. [68]

    Big Five scores: Provide exact scores for O, C, E, A, N (0.0-1.0 scale)

  69. [69]

    One dimension must be extreme (0.95 for high or 0.10 for low), others moderate (0.30-0.70)

  70. [70]

    Self-value logic: One sentence describing the character's core cognitive operating principle

  71. [71]

    Core behavioral patterns: Exactly 8 specific, observable patterns that manifest the dominant trait

  72. [72]

    char_key

    Occupation: Must be ecologically consistent with the dominant trait ## Output Format (JSON) { "char_key": "CHAR_XX", "name": "Chinese name", "occupation": "specific job title", "big_five": {"O": 0.X, "C": 0.X, "E": 0.X, "A": 0.X, "N": 0.X}, "big_five_str": "O=0.X, C=0.X, E=0.X, A=0.X, N=0.X", "description": "2-3 sentence character summary", "self_value_lo...

  73. [73]

    Environmental sensory: visual/auditory/tactile/olfactory details

  74. [74]

    Dialogue reconstruction: at least 1-2 segments of real dialogue (in quotes)

  75. [75]

    Inner monologue: first-person immediate thoughts and emotions

  76. [76]

    Somatic response: heartbeat/sweating/muscle tension and other bodily sensations

  77. [77]

    I" throughout; prohibit third-person pronouns for protagonist** - **Prohibit character's own name in memory**; always use

    Aftermath: immediate impact after the event (limited to 24 hours-1 week post-event) ## Key Constraint: First-Person & Anonymity Principle - **Must use first-person "I" throughout; prohibit third-person pronouns for protagonist** - **Prohibit character's own name in memory**; always use "I" instead - When others address protagonist in dialogue, avoid chara...

  78. [78]

    **id**: SCN_{STAGE}_{N} in upper case, where STAGE is one of SCHOOL_AGE / ADOLESCENCE / EARLY_ADULT_TRANSITION / ENTERING_ADULT_WORLD / AGE_30_TRANSITION / SETTLING_DOWN / MIDLIFE_TRANSITION / ENTERING_MIDLIFE

  79. [79]

    **stage**: lowercase form of the above

  80. [80]

    **diamonds_dimension**: the dominant DIAMONDS dimension (Duty, Adversity, Positivity, ...) as given in the sketch

Showing first 80 references.