HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?

Bowen Li; Chenxu Zhang; Chunliang Feng; Heng Lian; Jiahao Pang; Qianao Wang; Qihong Mao; Weihan Peng; Xiaodong Gu; Yuling Shi

arxiv: 2605.30058 · v1 · pith:PEDG4SC5new · submitted 2026-05-28 · 💻 cs.CL

HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?

Weihan Peng , Chenxu Zhang , Qianao Wang , Yuling Shi , Heng Lian , Qihong Mao , Jiahao Pang , Chunliang Feng

show 2 more authors

Bowen Li Xiaodong Gu

This is my paper

Pith reviewed 2026-06-29 07:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM agentsBig Five personalityepisodic memoriesDIAMONDS taxonomypersonality consistencydecision-making scenariospsychology benchmark

0 comments

The pith

HEART-Bench tests whether LLM agents make decisions consistent with assigned Big Five personality traits and 1,000 episodic memories across 64 scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HEART-Bench to check if LLM agents can exhibit coherent human-like psychology by maintaining personality consistency and value-aligned choices. It builds 11 characters from orthogonal Big Five traits deeply integrated with structured autobiographical memories spanning developmental stages. Agents face 64 decision-making scenarios drawn from the DIAMONDS taxonomy, which covers eight situational dimensions including Duty, Intellect, Adversity, and Sociality. The resulting 673 validated multiple-choice questions measure whether choices align with each character's specific profile. A sympathetic reader would care because this supplies a concrete way to evaluate psychological simulation beyond isolated task performance.

Core claim

The benchmark constructs 11 diverse human characters grounded in orthogonal Big Five personality traits, with each profile integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. It subjects agents to 64 decision-making scenarios guided by the DIAMONDS taxonomy and evaluates whether agents consolidate their traits and memories to produce behavioral decisions consistent with their psychological profiles, yielding 673 multiple-choice questions after human validation.

What carries the argument

The DIAMONDS taxonomy of eight situational dimensions used to curate the 64 scenarios that probe whether decisions remain consistent with pre-assigned personality profiles and episodic memories.

If this is right

Agents can be evaluated for the ability to consolidate personality traits with autobiographical memories when facing varied situations.
The method supplies a scalable way to study value-consistent behavioral decision-making in LLM agents.
Human-validated multiple-choice questions allow repeatable measurement of personality consistency and emotional dimensions.
The benchmark treats emotional dimensions as equal in importance to task-oriented abilities such as planning and reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same construction could be reused to track whether larger models maintain profile consistency better than smaller ones.
Conflicts deliberately introduced between memories and traits could test whether agents resolve value tensions in human-like ways.
Results might inform whether current alignment techniques produce stable psychological profiles or only surface-level mimicry.

Load-bearing premise

That consistency between an agent's choices in the scenarios and its pre-assigned personality profile plus memories counts as evidence of coherent human-like psychology.

What would settle it

Observing that agents assigned different personality profiles produce statistically similar choice patterns across the 64 scenarios would show the benchmark fails to detect profile-specific psychology.

read the original abstract

While LLM agents have demonstrated remarkable task-oriented abilities such as planning, reasoning, and action, few works have treated them as complete human personalities where emotional dimensions hold equal importance. In this paper, we introduce a novel benchmark to systematically assess whether LLM agents can simulate coherent, human-like psychology. Specifically, our benchmark constructs 11 diverse human characters grounded in orthogonal Big Five personality traits, with each profile deeply integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. To rigorously evaluate the psychological manifestations of LLMs, we designed a curated suite of 64 decision-making scenarios, guided by the DIAMONDS taxonomy, a psychological framework that characterizes situations along eight dimensions: Duty, Intellect, Adversity, Mating, pOsitivity, Negativity, Deception, and Sociality. By subjecting agents to varying scenarios, the benchmark evaluates whether they can consolidate their innate personality traits and autobiographical memories to make behavioral decisions that are consistent with their specific psychological profiles. After systematic human validation and filtering, we obtained a benchmark consisting of 673 multiple-choice questions (MCQs). We believe this benchmark provides a principled and scalable testbed for studying human-like emotions, personality consistency, and value-consistent behavioural decision-making in LLM-based agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HEART-Bench supplies a new, detailed MCQ set for personality consistency but does not yet isolate that from simple instruction following.

read the letter

The paper's concrete output is HEART-Bench: 11 characters defined by orthogonal Big Five profiles, each paired with 1,000 episodic memories spread across developmental stages, then 64 DIAMONDS scenarios turned into 673 human-validated multiple-choice questions.

They did the construction work carefully. The memory integration and situation taxonomy come from established psychology sources, and the human filtering step is described. That combination of elements is not already in the cited literature, so the benchmark itself is new.

The main limitation is the missing separation between prompt adherence and anything deeper. Every agent prompt contains the full personality description and memory list, so a model that simply repeats or applies that supplied text will score high on consistency. The abstract gives no ablations, no conflicting-memory conditions, and no comparison against human response distributions that would show the test measures more than retrieval. Without those, the claim that the benchmark captures human-like psychology rests on an assumption rather than evidence.

The abstract also stops after benchmark creation. No LLM performance numbers appear, which leaves open whether the items are hard, easy, or even answerable in a way that distinguishes models.

This is useful for groups already running agent evaluations who want a psychology-flavored option to add to their suite. It is not yet ready to support strong claims about simulation of human psychology. A referee could usefully ask for the controls and at least one set of model results before publication.

Referee Report

2 major / 0 minor

Summary. The paper introduces HEART-Bench, a benchmark consisting of 11 characters grounded in orthogonal Big Five personality traits integrated with 1,000 episodic memories each, evaluated via 64 DIAMONDS scenarios to produce 673 MCQs after human validation. The central claim is that this provides a principled testbed for whether LLM agents can consolidate personality traits and memories into behaviorally consistent decisions that reflect human-like psychology.

Significance. If the benchmark can be shown to isolate psychological simulation from instruction-following, it would offer a scalable resource for studying personality consistency and value-based decisions in LLM agents. The current design, however, does not yet demonstrate this isolation, limiting the significance of the contribution to the field.

major comments (2)

[Abstract, evaluation design paragraph] Abstract, evaluation design paragraph: the claim that consistency between agent choices on the 64 DIAMONDS scenarios and the supplied Big-Five profiles plus 1,000 episodic memories constitutes evidence of coherent, human-like psychology lacks any described controls (e.g., profile ablation, conflicting memories, or comparison against human response distributions) that would distinguish this from simple prompt parroting.
[Abstract] Abstract: the manuscript describes benchmark construction and human validation but supplies no empirical results on LLM performance, no error analysis, and no evidence that the MCQs actually elicit or measure the claimed psychological consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the benchmark design and scope. We respond to each major comment below and indicate planned revisions to address the concerns while preserving the paper's focus on benchmark construction and validation.

read point-by-point responses

Referee: [Abstract, evaluation design paragraph] Abstract, evaluation design paragraph: the claim that consistency between agent choices on the 64 DIAMONDS scenarios and the supplied Big-Five profiles plus 1,000 episodic memories constitutes evidence of coherent, human-like psychology lacks any described controls (e.g., profile ablation, conflicting memories, or comparison against human response distributions) that would distinguish this from simple prompt parroting.

Authors: We agree that the current text overstates the evidential value of consistency without controls. The benchmark design uses orthogonal traits and 1,000 memories per character to increase the difficulty of pure prompt parroting, but this does not substitute for explicit ablations. In revision we will (1) soften the abstract and evaluation paragraphs to describe the benchmark as enabling tests of psychological consistency rather than directly constituting evidence, (2) add a limitations subsection that explicitly discusses the lack of profile ablation, conflicting-memory, and human-distribution controls, and (3) outline planned follow-up experiments using those controls. revision: yes
Referee: [Abstract] Abstract: the manuscript describes benchmark construction and human validation but supplies no empirical results on LLM performance, no error analysis, and no evidence that the MCQs actually elicit or measure the claimed psychological consistency.

Authors: The manuscript's primary contribution is the construction and human validation of HEART-Bench; no LLM evaluations or error analyses are reported because the work stops at releasing the testbed. We acknowledge that including baseline LLM results would help demonstrate the benchmark's utility. In the revised version we will add a short experimental section reporting performance of 2-3 representative LLMs on the 673 MCQs together with a basic error analysis, while clearly labeling these results as preliminary. revision: yes

Circularity Check

0 steps flagged

No significant circularity; externally constructed benchmark with no fitted predictions or self-referential derivations

full rationale

The paper introduces HEART-Bench by constructing 11 character profiles from orthogonal Big Five traits, integrating 1,000 author-generated episodic memories, and curating 64 DIAMONDS-guided scenarios into 673 MCQs. Evaluation measures behavioral consistency with the supplied profiles and memories. This construction is explicit and external to any prior fitted parameters or equations in the paper. No derivation chain exists that reduces a claimed prediction to its inputs by construction (e.g., no parameter fitting followed by renamed prediction, no self-citation load-bearing a uniqueness theorem, no ansatz smuggled via citation). The central claim concerns the benchmark's utility as a testbed rather than deriving a result equivalent to its inputs. Self-citations, if present, are not load-bearing for the benchmark design itself. This matches the default expectation of no circularity for a benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the validity of two established psychological frameworks and the untested assumption that MCQ consistency equals human-like psychology. No free parameters are introduced. The benchmark itself is the primary invented entity but carries no independent evidence beyond the paper's construction.

axioms (2)

domain assumption Big Five personality traits are orthogonal and suitable for constructing distinct human characters
Invoked to ground the 11 profiles in the abstract
domain assumption DIAMONDS taxonomy accurately characterizes situations for eliciting psychological responses
Used to guide the 64 decision-making scenarios

invented entities (1)

HEART-Bench no independent evidence
purpose: Testbed for LLM psychological consistency
The benchmark is the paper's main contribution

pith-pipeline@v0.9.1-grok · 5781 in / 1419 out tokens · 34721 ms · 2026-06-29T07:16:35.138416+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

107 extracted references · 15 canonical work pages · 8 internal anchors

[1]

Introducing Claude Opus 4.7.https://www.anthropic.com/news/claude-opus-4-7 , 2026

Anthropic. Introducing Claude Opus 4.7.https://www.anthropic.com/news/claude-opus-4-7 , 2026. Accessed: 2026-05-06

2026
[2]

Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, 2026

OpenAI. Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, 2026. Accessed: 2026-05-06

2026
[3]

Gemini 3.1 Pro: A smarter model for your most complex tasks.https://blog.google/innovation-a nd-ai/models-and-research/gemini-models/gemini-3-1-pro/, 2026

Google DeepMind. Gemini 3.1 Pro: A smarter model for your most complex tasks.https://blog.google/innovation-a nd-ai/models-and-research/gemini-models/gemini-3-1-pro/, 2026. Accessed: 2026-05-06

2026
[4]

Susan T Fiske, Amy JC Cuddy, Peter Glick, and Jun Xu

Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, Aili Chen, Nianqi Li, Lida Chen, Caiyu Hu, Siye Wu, Scott Ren, Ziquan Fu, and Yanghua Xiao. From persona to personalization: A survey on role-playing language agents.arXiv preprint arXiv:2404.18231, 2024

work page arXiv 2024
[5]

Two tales of persona in llms: A survey of role-playing and personalization

Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, and Yun-Nung Chen. Two tales of persona in llms: A survey of role-playing and personalization. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 16612–16631, 2024

2024
[6]

Loneliness and suicide mitigation for students using gpt3-enabled chatbots.npj mental health research, 3(1):4, 2024

Bethanie Maples, Merve Cerit, Aditya Vishwanath, and Roy Pea. Loneliness and suicide mitigation for students using gpt3-enabled chatbots.npj mental health research, 3(1):4, 2024

2024
[7]

Investigating affective use and emotional well-being on chatgpt.arXiv preprint arXiv:2504.03888, 2025

Jason Phang, Michael Lampe, Lama Ahmad, Sandhini Agarwal, Cathy Mengying Fang, Auren R Liu, Valdemar Danry, Eunhae Lee, Samantha WT Chan, Pat Pataranutaporn, et al. Investigating affective use and emotional well-being on chatgpt.arXiv preprint arXiv:2504.03888, 2025

work page arXiv 2025
[8]

Ai-native memory 2.0: Second me.arXiv preprint arXiv:2503.08102, 2025

Jiale Wei, Xiang Ying, Tao Gao, Fangyi Bao, Felix Tao, and Jingbo Shang. Ai-native memory 2.0: Second me.arXiv preprint arXiv:2503.08102, 2025

work page arXiv 2025
[9]

Creatingtext-basedaiclonesofmyself: Exploringperceptions, development strategies, and challenges.International Journal of Human-Computer Studies, 208:103692, 2026

DonggunLee,SuyounLee,HyunseungLim,andHwajungHong. Creatingtext-basedaiclonesofmyself: Exploringperceptions, development strategies, and challenges.International Journal of Human-Computer Studies, 208:103692, 2026

2026
[10]

University of Chicago Press, 2000

George E Marcus, W Russell Neuman, and Michael MacKuen.Affective intelligence and political judgment. University of Chicago Press, 2000

2000
[11]

Emotional intelligence.Imagination, cognition and personality, 9(3):185–211, 1990

Peter Salovey and John D Mayer. Emotional intelligence.Imagination, cognition and personality, 9(3):185–211, 1990

1990
[12]

Seeing one’s self: Locating narrative memory in a framework of personality.Journal of Personality, 63(3):429–457, 1995

Jofferson A Singer. Seeing one’s self: Locating narrative memory in a framework of personality.Journal of Personality, 63(3):429–457, 1995

1995
[13]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

PatrickLewis,EthanPerez,AleksandraPiktus,FabioPetroni,VladimirKarpukhin,NamanGoyal,HeinrichKüttler,MikeLewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

2020
[14]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Persona-db: Efficient large language model personalization for response prediction with collaborative data refinement

Chenkai Sun, Ke Yang, Revanth Gangi Reddy, Yi Fung, Hou Pong Chan, Kevin Small, ChengXiang Zhai, and Heng Ji. Persona-db: Efficient large language model personalization for response prediction with collaborative data refinement. In Proceedings of the 31st International Conference on Computational Linguistics, pages 281–296, 2025

2025
[16]

Charactereval: Achinesebenchmarkfor role-playing conversational agent evaluation

QuanTu,ShilongFan, ZihangTian, TianhaoShen,ShuoShang,XinGao, andRuiYan. Charactereval: Achinesebenchmarkfor role-playing conversational agent evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11836–11850, 2024. 10

2024
[17]

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas.arXiv preprint arXiv:2406.20094, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Evaluating very long-termconversationalmemoryofllmagents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-termconversationalmemoryofllmagents. InProceedingsofthe62ndAnnualMeetingoftheAssociationforComputational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

2024
[19]

Clonemem: Benchmarking long-term memory for ai clones.arXiv preprint arXiv:2601.07023, 2026

SenHu,ZhiyuZhang,YuxiangWei,XueranHan,ZhenhengTang,HuacanWang,andRonghaoChen. Clonemem: Benchmarking long-term memory for ai clones.arXiv preprint arXiv:2601.07023, 2026

work page arXiv 2026
[20]

Personalens: A benchmark for personalization evaluation in conversational ai assistants

Zheng Zhao, Clara Vania, Subhradeep Kayal, Naila Khan, Shay B Cohen, and Emine Yilmaz. Personalens: A benchmark for personalization evaluation in conversational ai assistants. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18023–18055, 2025

2025
[21]

Know me, respond to me: Benchmarking LLMs for dynamic user profiling and personalized responses at scale

Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J Taylor, and Dan Roth. Know me, respond to me: Benchmarking llms for dynamic user profiling and personalized responses at scale.arXiv preprint arXiv:2504.14225, 2025

work page arXiv 2025
[22]

An introduction to the five-factor model and its applications.Journal of personality, 60(2):175–215, 1992

Robert R McCrae and Oliver P John. An introduction to the five-factor model and its applications.Journal of personality, 60(2):175–215, 1992

1992
[23]

Personality trait structure as a human universal.American Psychologist, 52(5):509–516, 1997

Robert R McCrae and Paul T Costa. Personality trait structure as a human universal.American Psychologist, 52(5):509–516, 1997

1997
[24]

The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes

Brent W Roberts, Nathan R Kuncel, Rebecca Shiner, Avshalom Caspi, and Lewis R Goldberg. The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. Perspectives on Psychological Science, 2(4):313–345, 2007

2007
[25]

Toward a structure- and process-integrated view of personality: Traits as density distributions of states

William Fleeson. Toward a structure- and process-integrated view of personality: Traits as density distributions of states. Journal of Personality and Social Psychology, 80(6):1011–1027, 2001

2001
[26]

The construction of autobiographical memories in the self-memory system.Psychological review, 107(2):261, 2000

Martin A Conway and Christopher W Pleydell-Pearce. The construction of autobiographical memories in the self-memory system.Psychological review, 107(2):261, 2000

2000
[27]

Young, Janet S

Jeffrey E. Young, Janet S. Klosko, and Marjorie E. Weishaar.Schema Therapy: A Practitioner’s Guide. Guilford Press, 2006

2006
[28]

Character-LLM: A trainable agent for role-playing

Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. Character-LLM: A trainable agent for role-playing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13153–13187, 2023

2023
[29]

PersonaLLM: Investigating the ability of large language models to express personality traits

Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, and Jad Kabbara. PersonaLLM: Investigating the ability of large language models to express personality traits. InFindings of the association for computational linguistics: NAACL 2024, pages 3605–3627, 2024

2024
[30]

PersonaAgent: Bridging Memory and Action for Personalized LLM Agents

Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, Xiaoman Pan, Lian Xiong, Jingguo Liu, Philip S. Yu, and Xian Li. PersonaAgent: When large language model agents meet personalization at test time.arXiv preprint arXiv:2506.06254, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

RGMem: Renormalization Group-inspired Memory Evolution for Language Agents

Ao Tian, Yunfeng Lu, Xinxin Fan, Changhao Wang, Lanzhi Zhou, Yeyao Zhang, and Yanfang Liu. Rgmem: Renormalization group-based memory evolution for language agent user profile.arXiv preprint arXiv:2510.16392, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems. 2023

2023
[33]

Hipporag: Neurobiologically inspired long-term memory for large language models.Advances in neural information processing systems, 37:59532–59569, 2024

Bernal J Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobiologically inspired long-term memory for large language models.Advances in neural information processing systems, 37:59532–59569, 2024

2024
[34]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for LLM agents. arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Mood and memory.American psychologist, 36(2):129, 1981

Gordon H Bower. Mood and memory.American psychologist, 36(2):129, 1981

1981
[36]

InProceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+ CSS), pages 218–227

Greg Serapio-García, Mustafa Safdari, Clément Crepy, Luning Sun, Stephen Fitz, Peter Romero, Marwa Abdulhai, Aleksandra Faust, and Maja Matarić. Personality traits in large language models.arXiv preprint arXiv:2307.00184, 2023

work page arXiv 2023
[37]

The psychology of life stories.Review of general psychology, 5(2):100–122, 2001

Dan P McAdams. The psychology of life stories.Review of general psychology, 5(2):100–122, 2001. 11

2001
[38]

Developing a life story: Constructing relations between self and experience in autobiographical narratives.Human development, 50(2-3):85–110, 2007

Monisha Pasupathi, Emma Mansour, and Jed R Brubaker. Developing a life story: Constructing relations between self and experience in autobiographical narratives.Human development, 50(2-3):85–110, 2007

2007
[39]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

DiWu,HongweiWang,WenhaoYu,YuweiZhang,Kai-WeiChang,andDongYu. Longmemeval: Benchmarkingchatassistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Realmem: Benchmarking LLMs in real-world memory-driven interaction.arXiv preprint arXiv:2601.06966, 2026

Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, and Ronghao Chen. Realmem: Benchmarking LLMs in real-world memory-driven interaction.arXiv preprint arXiv:2601.06966, 2026

work page arXiv 2026
[41]

KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

TingyuWu,ZhishengChen,ZiyanWeng,ShuheWang,ChenglongLi,ShuoZhang,SenHu,SilinWu,QizhenLan,HuacanWang, et al. Knowme-bench: Benchmarking person understanding for lifelong digital companions.arXiv preprint arXiv:2601.04745, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

The situational eight diamonds: A taxonomy of major dimensions of situation characteristics.Journal of Personality and Social Psychology, 107(4):677–718, 2014

John F Rauthmann, David Gallardo-Pujol, Emmanuel M Guillaume, Emma Todd, Christopher S Nave, Robert A Sherman, Matthias Ziegler, and David C Funder. The situational eight diamonds: A taxonomy of major dimensions of situation characteristics.Journal of Personality and Social Psychology, 107(4):677–718, 2014

2014
[43]

Thebasic-systemsmodelofautobiographicalmemory.PerspectivesonPsychologicalScience, 1(1):3–11, 2006

DavidC.Rubin. Thebasic-systemsmodelofautobiographicalmemory.PerspectivesonPsychologicalScience, 1(1):3–11, 2006

2006
[44]

Erikson.Childhood and Society

Erik H. Erikson.Childhood and Society. W. W. Norton & Company, New York, 2nd edition, 1963

1963
[45]

Levinson, Charlotte N

Daniel J. Levinson, Charlotte N. Darrow, Edward B. Klein, Maria H. Levinson, and Braxton McKee.The Seasons of a Man’s Life. Knopf, New York, 1978

1978
[46]

Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries

Shalom H Schwartz. Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries. InAdvances in experimental social psychology, volume 25, pages 1–65. Elsevier, 1992

1992
[47]

Levinson

Daniel J. Levinson. A conception of adult development.American Psychologist, 41(1):3–13, 1986

1986
[48]

Mitigating gender bias in natural language processing: Literature review

Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang. Mitigating gender bias in natural language processing: Literature review. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 1630–1640, 2019

2019
[49]

Introducing Claude Opus 4.6.https://www.anthropic.com/news/claude-opus-4-6 , 2026

Anthropic. Introducing Claude Opus 4.6.https://www.anthropic.com/news/claude-opus-4-6 , 2026. Accessed: 2026-05-06

2026
[50]

Deepseek-v3.2: Pushing the frontier of open large language models, 2025

DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, 2025

2025
[51]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

2026
[52]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

2026
[53]

Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026

OpenAI. Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026. Accessed: 2026-05-06

2026
[54]

Introducing Claude Haiku 4.5.https://www.anthropic.com/news/claude-haiku-4-5 , 2025

Anthropic. Introducing Claude Haiku 4.5.https://www.anthropic.com/news/claude-haiku-4-5 , 2025. Accessed: 2026-05-06

2025
[55]

Introducing Claude Sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026

Anthropic. Introducing Claude Sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026. Accessed: 2026-05-06

2026
[56]

Gemini 3 Flash model card.https://blog.google/products-and-platforms/products/gemini /gemini-3-flash/, 2025

Google DeepMind. Gemini 3 Flash model card.https://blog.google/products-and-platforms/products/gemini /gemini-3-flash/, 2025. Accessed: 2026-05-06

2025
[57]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025. 12 Appendices A Benchmark Comparison Dimensions......................................

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Character constraints: full Big Five vector, occupation, self-value logic, and all eight core behavioural patterns from Appendix B.1
[59]

Developmental context: target age window, life-stage label, and the core developmental task from Table 8 (Appendix B.6)
[60]

failed a job interview

Rubincoverage: per-dimensionguidancerequiringorganicintegrationofallfivedimensionswithoutexplicitsection labels. Unlike semantic memory—which stores abstract, decontextualized facts (e.g., “failed a job interview”)—each content_full narrative must encode information that collectively reconstructs the subjective texture of the experience and is irreducible...
[61]

years later,

Perspective Drift (Omniscient Narrator).Violating the first-person narrative constraint. Detected when the narrator uses post-hoc knowledge or future-tense markers (e.g., “years later,” “I didn’t know then”). This violates the strict “in-the-moment” constraint. 2.Perspective Drift (Adult Evaluation).Specifically flags child-age memories (≤12) that use adu...
[62]

context exceeding 60 characters orcontent_summary exceeding 100 characters, indicating narrative leakage into metadata fields

Field-Length Inversion. context exceeding 60 characters orcontent_summary exceeding 100 characters, indicating narrative leakage into metadata fields
[63]

This is detected using a cosine similarity threshold of>0.85over TF-IDF vector representations of the narratives

Semantic Duplication.Redundant life episodes within the same developmental age window. This is detected using a cosine similarity threshold of>0.85over TF-IDF vector representations of the narratives
[64]

Length Violation.content_fullbelow 2,000 characters (insufficient Rubin coverage) or above 4,500 characters (excessive verbosity)
[65]

abandonment schema

Schema-Vocabulary Inflation.Use of high-register psychoanalytic terms (e.g., “abandonment schema”, “dysregula- tion”) in low-intensity episodes (<0.4), indicating over-psychologisation. Repair procedure.Flagged memories enter a targeted repair loop: only the defective field(s) are regenerated using an issue-specificinstructionappendedtotheoriginalgenerati...
[66]

Any flagged scenario is revised by consensus before entering the second pass

Logical and psychological screening.Two doctoral-level psychology researchers jointly check each scenario for logical consistency and psychological plausibility. Any flagged scenario is revised by consensus before entering the second pass
[67]

SCN_ENTERING_MIDLIFE_5

Independent rating and conflict resolution.Both reviewers independently rate each scenario on the eight DIA- MONDS dimensions (1–7 Likert) and identify the scenario’s dominant Schwartz value-conflict pair. Disagreements above 1 point trigger joint re-assessment. We do not require a uniform DIAMONDS rating distribution; the goal is that each dimension be e...
[68]

Big Five scores: Provide exact scores for O, C, E, A, N (0.0-1.0 scale)
[69]

One dimension must be extreme (0.95 for high or 0.10 for low), others moderate (0.30-0.70)
[70]

Self-value logic: One sentence describing the character's core cognitive operating principle
[71]

Core behavioral patterns: Exactly 8 specific, observable patterns that manifest the dominant trait
[72]

char_key

Occupation: Must be ecologically consistent with the dominant trait ## Output Format (JSON) { "char_key": "CHAR_XX", "name": "Chinese name", "occupation": "specific job title", "big_five": {"O": 0.X, "C": 0.X, "E": 0.X, "A": 0.X, "N": 0.X}, "big_five_str": "O=0.X, C=0.X, E=0.X, A=0.X, N=0.X", "description": "2-3 sentence character summary", "self_value_lo...

2006
[73]

Environmental sensory: visual/auditory/tactile/olfactory details
[74]

Dialogue reconstruction: at least 1-2 segments of real dialogue (in quotes)
[75]

Inner monologue: first-person immediate thoughts and emotions
[76]

Somatic response: heartbeat/sweating/muscle tension and other bodily sensations
[77]

I" throughout; prohibit third-person pronouns for protagonist** - **Prohibit character's own name in memory**; always use

Aftermath: immediate impact after the event (limited to 24 hours-1 week post-event) ## Key Constraint: First-Person & Anonymity Principle - **Must use first-person "I" throughout; prohibit third-person pronouns for protagonist** - **Prohibit character's own name in memory**; always use "I" instead - When others address protagonist in dialogue, avoid chara...

2000
[78]

**id**: SCN_{STAGE}_{N} in upper case, where STAGE is one of SCHOOL_AGE / ADOLESCENCE / EARLY_ADULT_TRANSITION / ENTERING_ADULT_WORLD / AGE_30_TRANSITION / SETTLING_DOWN / MIDLIFE_TRANSITION / ENTERING_MIDLIFE
[79]

**stage**: lowercase form of the above
[80]

**diamonds_dimension**: the dominant DIAMONDS dimension (Duty, Adversity, Positivity, ...) as given in the sketch

Showing first 80 references.

[1] [1]

Introducing Claude Opus 4.7.https://www.anthropic.com/news/claude-opus-4-7 , 2026

Anthropic. Introducing Claude Opus 4.7.https://www.anthropic.com/news/claude-opus-4-7 , 2026. Accessed: 2026-05-06

2026

[2] [2]

Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, 2026

OpenAI. Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, 2026. Accessed: 2026-05-06

2026

[3] [3]

Gemini 3.1 Pro: A smarter model for your most complex tasks.https://blog.google/innovation-a nd-ai/models-and-research/gemini-models/gemini-3-1-pro/, 2026

Google DeepMind. Gemini 3.1 Pro: A smarter model for your most complex tasks.https://blog.google/innovation-a nd-ai/models-and-research/gemini-models/gemini-3-1-pro/, 2026. Accessed: 2026-05-06

2026

[4] [4]

Susan T Fiske, Amy JC Cuddy, Peter Glick, and Jun Xu

Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, Aili Chen, Nianqi Li, Lida Chen, Caiyu Hu, Siye Wu, Scott Ren, Ziquan Fu, and Yanghua Xiao. From persona to personalization: A survey on role-playing language agents.arXiv preprint arXiv:2404.18231, 2024

work page arXiv 2024

[5] [5]

Two tales of persona in llms: A survey of role-playing and personalization

Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, and Yun-Nung Chen. Two tales of persona in llms: A survey of role-playing and personalization. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 16612–16631, 2024

2024

[6] [6]

Loneliness and suicide mitigation for students using gpt3-enabled chatbots.npj mental health research, 3(1):4, 2024

Bethanie Maples, Merve Cerit, Aditya Vishwanath, and Roy Pea. Loneliness and suicide mitigation for students using gpt3-enabled chatbots.npj mental health research, 3(1):4, 2024

2024

[7] [7]

Investigating affective use and emotional well-being on chatgpt.arXiv preprint arXiv:2504.03888, 2025

Jason Phang, Michael Lampe, Lama Ahmad, Sandhini Agarwal, Cathy Mengying Fang, Auren R Liu, Valdemar Danry, Eunhae Lee, Samantha WT Chan, Pat Pataranutaporn, et al. Investigating affective use and emotional well-being on chatgpt.arXiv preprint arXiv:2504.03888, 2025

work page arXiv 2025

[8] [8]

Ai-native memory 2.0: Second me.arXiv preprint arXiv:2503.08102, 2025

Jiale Wei, Xiang Ying, Tao Gao, Fangyi Bao, Felix Tao, and Jingbo Shang. Ai-native memory 2.0: Second me.arXiv preprint arXiv:2503.08102, 2025

work page arXiv 2025

[9] [9]

Creatingtext-basedaiclonesofmyself: Exploringperceptions, development strategies, and challenges.International Journal of Human-Computer Studies, 208:103692, 2026

DonggunLee,SuyounLee,HyunseungLim,andHwajungHong. Creatingtext-basedaiclonesofmyself: Exploringperceptions, development strategies, and challenges.International Journal of Human-Computer Studies, 208:103692, 2026

2026

[10] [10]

University of Chicago Press, 2000

George E Marcus, W Russell Neuman, and Michael MacKuen.Affective intelligence and political judgment. University of Chicago Press, 2000

2000

[11] [11]

Emotional intelligence.Imagination, cognition and personality, 9(3):185–211, 1990

Peter Salovey and John D Mayer. Emotional intelligence.Imagination, cognition and personality, 9(3):185–211, 1990

1990

[12] [12]

Seeing one’s self: Locating narrative memory in a framework of personality.Journal of Personality, 63(3):429–457, 1995

Jofferson A Singer. Seeing one’s self: Locating narrative memory in a framework of personality.Journal of Personality, 63(3):429–457, 1995

1995

[13] [13]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

PatrickLewis,EthanPerez,AleksandraPiktus,FabioPetroni,VladimirKarpukhin,NamanGoyal,HeinrichKüttler,MikeLewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

2020

[14] [14]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Persona-db: Efficient large language model personalization for response prediction with collaborative data refinement

Chenkai Sun, Ke Yang, Revanth Gangi Reddy, Yi Fung, Hou Pong Chan, Kevin Small, ChengXiang Zhai, and Heng Ji. Persona-db: Efficient large language model personalization for response prediction with collaborative data refinement. In Proceedings of the 31st International Conference on Computational Linguistics, pages 281–296, 2025

2025

[16] [16]

Charactereval: Achinesebenchmarkfor role-playing conversational agent evaluation

QuanTu,ShilongFan, ZihangTian, TianhaoShen,ShuoShang,XinGao, andRuiYan. Charactereval: Achinesebenchmarkfor role-playing conversational agent evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11836–11850, 2024. 10

2024

[17] [17]

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas.arXiv preprint arXiv:2406.20094, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Evaluating very long-termconversationalmemoryofllmagents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-termconversationalmemoryofllmagents. InProceedingsofthe62ndAnnualMeetingoftheAssociationforComputational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

2024

[19] [19]

Clonemem: Benchmarking long-term memory for ai clones.arXiv preprint arXiv:2601.07023, 2026

SenHu,ZhiyuZhang,YuxiangWei,XueranHan,ZhenhengTang,HuacanWang,andRonghaoChen. Clonemem: Benchmarking long-term memory for ai clones.arXiv preprint arXiv:2601.07023, 2026

work page arXiv 2026

[20] [20]

Personalens: A benchmark for personalization evaluation in conversational ai assistants

Zheng Zhao, Clara Vania, Subhradeep Kayal, Naila Khan, Shay B Cohen, and Emine Yilmaz. Personalens: A benchmark for personalization evaluation in conversational ai assistants. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18023–18055, 2025

2025

[21] [21]

Know me, respond to me: Benchmarking LLMs for dynamic user profiling and personalized responses at scale

Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J Taylor, and Dan Roth. Know me, respond to me: Benchmarking llms for dynamic user profiling and personalized responses at scale.arXiv preprint arXiv:2504.14225, 2025

work page arXiv 2025

[22] [22]

An introduction to the five-factor model and its applications.Journal of personality, 60(2):175–215, 1992

Robert R McCrae and Oliver P John. An introduction to the five-factor model and its applications.Journal of personality, 60(2):175–215, 1992

1992

[23] [23]

Personality trait structure as a human universal.American Psychologist, 52(5):509–516, 1997

Robert R McCrae and Paul T Costa. Personality trait structure as a human universal.American Psychologist, 52(5):509–516, 1997

1997

[24] [24]

The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes

Brent W Roberts, Nathan R Kuncel, Rebecca Shiner, Avshalom Caspi, and Lewis R Goldberg. The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. Perspectives on Psychological Science, 2(4):313–345, 2007

2007

[25] [25]

Toward a structure- and process-integrated view of personality: Traits as density distributions of states

William Fleeson. Toward a structure- and process-integrated view of personality: Traits as density distributions of states. Journal of Personality and Social Psychology, 80(6):1011–1027, 2001

2001

[26] [26]

The construction of autobiographical memories in the self-memory system.Psychological review, 107(2):261, 2000

Martin A Conway and Christopher W Pleydell-Pearce. The construction of autobiographical memories in the self-memory system.Psychological review, 107(2):261, 2000

2000

[27] [27]

Young, Janet S

Jeffrey E. Young, Janet S. Klosko, and Marjorie E. Weishaar.Schema Therapy: A Practitioner’s Guide. Guilford Press, 2006

2006

[28] [28]

Character-LLM: A trainable agent for role-playing

Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. Character-LLM: A trainable agent for role-playing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13153–13187, 2023

2023

[29] [29]

PersonaLLM: Investigating the ability of large language models to express personality traits

Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, and Jad Kabbara. PersonaLLM: Investigating the ability of large language models to express personality traits. InFindings of the association for computational linguistics: NAACL 2024, pages 3605–3627, 2024

2024

[30] [30]

PersonaAgent: Bridging Memory and Action for Personalized LLM Agents

Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, Xiaoman Pan, Lian Xiong, Jingguo Liu, Philip S. Yu, and Xian Li. PersonaAgent: When large language model agents meet personalization at test time.arXiv preprint arXiv:2506.06254, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

RGMem: Renormalization Group-inspired Memory Evolution for Language Agents

Ao Tian, Yunfeng Lu, Xinxin Fan, Changhao Wang, Lanzhi Zhou, Yeyao Zhang, and Yanfang Liu. Rgmem: Renormalization group-based memory evolution for language agent user profile.arXiv preprint arXiv:2510.16392, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems. 2023

2023

[33] [33]

Hipporag: Neurobiologically inspired long-term memory for large language models.Advances in neural information processing systems, 37:59532–59569, 2024

Bernal J Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobiologically inspired long-term memory for large language models.Advances in neural information processing systems, 37:59532–59569, 2024

2024

[34] [34]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for LLM agents. arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Mood and memory.American psychologist, 36(2):129, 1981

Gordon H Bower. Mood and memory.American psychologist, 36(2):129, 1981

1981

[36] [36]

InProceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+ CSS), pages 218–227

Greg Serapio-García, Mustafa Safdari, Clément Crepy, Luning Sun, Stephen Fitz, Peter Romero, Marwa Abdulhai, Aleksandra Faust, and Maja Matarić. Personality traits in large language models.arXiv preprint arXiv:2307.00184, 2023

work page arXiv 2023

[37] [37]

The psychology of life stories.Review of general psychology, 5(2):100–122, 2001

Dan P McAdams. The psychology of life stories.Review of general psychology, 5(2):100–122, 2001. 11

2001

[38] [38]

Developing a life story: Constructing relations between self and experience in autobiographical narratives.Human development, 50(2-3):85–110, 2007

Monisha Pasupathi, Emma Mansour, and Jed R Brubaker. Developing a life story: Constructing relations between self and experience in autobiographical narratives.Human development, 50(2-3):85–110, 2007

2007

[39] [39]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

DiWu,HongweiWang,WenhaoYu,YuweiZhang,Kai-WeiChang,andDongYu. Longmemeval: Benchmarkingchatassistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Realmem: Benchmarking LLMs in real-world memory-driven interaction.arXiv preprint arXiv:2601.06966, 2026

Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, and Ronghao Chen. Realmem: Benchmarking LLMs in real-world memory-driven interaction.arXiv preprint arXiv:2601.06966, 2026

work page arXiv 2026

[41] [41]

KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

TingyuWu,ZhishengChen,ZiyanWeng,ShuheWang,ChenglongLi,ShuoZhang,SenHu,SilinWu,QizhenLan,HuacanWang, et al. Knowme-bench: Benchmarking person understanding for lifelong digital companions.arXiv preprint arXiv:2601.04745, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[42] [42]

The situational eight diamonds: A taxonomy of major dimensions of situation characteristics.Journal of Personality and Social Psychology, 107(4):677–718, 2014

John F Rauthmann, David Gallardo-Pujol, Emmanuel M Guillaume, Emma Todd, Christopher S Nave, Robert A Sherman, Matthias Ziegler, and David C Funder. The situational eight diamonds: A taxonomy of major dimensions of situation characteristics.Journal of Personality and Social Psychology, 107(4):677–718, 2014

2014

[43] [43]

Thebasic-systemsmodelofautobiographicalmemory.PerspectivesonPsychologicalScience, 1(1):3–11, 2006

DavidC.Rubin. Thebasic-systemsmodelofautobiographicalmemory.PerspectivesonPsychologicalScience, 1(1):3–11, 2006

2006

[44] [44]

Erikson.Childhood and Society

Erik H. Erikson.Childhood and Society. W. W. Norton & Company, New York, 2nd edition, 1963

1963

[45] [45]

Levinson, Charlotte N

Daniel J. Levinson, Charlotte N. Darrow, Edward B. Klein, Maria H. Levinson, and Braxton McKee.The Seasons of a Man’s Life. Knopf, New York, 1978

1978

[46] [46]

Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries

Shalom H Schwartz. Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries. InAdvances in experimental social psychology, volume 25, pages 1–65. Elsevier, 1992

1992

[47] [47]

Levinson

Daniel J. Levinson. A conception of adult development.American Psychologist, 41(1):3–13, 1986

1986

[48] [48]

Mitigating gender bias in natural language processing: Literature review

Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang. Mitigating gender bias in natural language processing: Literature review. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 1630–1640, 2019

2019

[49] [49]

Introducing Claude Opus 4.6.https://www.anthropic.com/news/claude-opus-4-6 , 2026

Anthropic. Introducing Claude Opus 4.6.https://www.anthropic.com/news/claude-opus-4-6 , 2026. Accessed: 2026-05-06

2026

[50] [50]

Deepseek-v3.2: Pushing the frontier of open large language models, 2025

DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, 2025

2025

[51] [51]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

2026

[52] [52]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

2026

[53] [53]

Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026

OpenAI. Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026. Accessed: 2026-05-06

2026

[54] [54]

Introducing Claude Haiku 4.5.https://www.anthropic.com/news/claude-haiku-4-5 , 2025

Anthropic. Introducing Claude Haiku 4.5.https://www.anthropic.com/news/claude-haiku-4-5 , 2025. Accessed: 2026-05-06

2025

[55] [55]

Introducing Claude Sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026

Anthropic. Introducing Claude Sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026. Accessed: 2026-05-06

2026

[56] [56]

Gemini 3 Flash model card.https://blog.google/products-and-platforms/products/gemini /gemini-3-flash/, 2025

Google DeepMind. Gemini 3 Flash model card.https://blog.google/products-and-platforms/products/gemini /gemini-3-flash/, 2025. Accessed: 2026-05-06

2025

[57] [57]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025. 12 Appendices A Benchmark Comparison Dimensions......................................

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Character constraints: full Big Five vector, occupation, self-value logic, and all eight core behavioural patterns from Appendix B.1

[59] [59]

Developmental context: target age window, life-stage label, and the core developmental task from Table 8 (Appendix B.6)

[60] [60]

failed a job interview

Rubincoverage: per-dimensionguidancerequiringorganicintegrationofallfivedimensionswithoutexplicitsection labels. Unlike semantic memory—which stores abstract, decontextualized facts (e.g., “failed a job interview”)—each content_full narrative must encode information that collectively reconstructs the subjective texture of the experience and is irreducible...

[61] [61]

years later,

Perspective Drift (Omniscient Narrator).Violating the first-person narrative constraint. Detected when the narrator uses post-hoc knowledge or future-tense markers (e.g., “years later,” “I didn’t know then”). This violates the strict “in-the-moment” constraint. 2.Perspective Drift (Adult Evaluation).Specifically flags child-age memories (≤12) that use adu...

[62] [62]

context exceeding 60 characters orcontent_summary exceeding 100 characters, indicating narrative leakage into metadata fields

Field-Length Inversion. context exceeding 60 characters orcontent_summary exceeding 100 characters, indicating narrative leakage into metadata fields

[63] [63]

This is detected using a cosine similarity threshold of>0.85over TF-IDF vector representations of the narratives

Semantic Duplication.Redundant life episodes within the same developmental age window. This is detected using a cosine similarity threshold of>0.85over TF-IDF vector representations of the narratives

[64] [64]

Length Violation.content_fullbelow 2,000 characters (insufficient Rubin coverage) or above 4,500 characters (excessive verbosity)

[65] [65]

abandonment schema

Schema-Vocabulary Inflation.Use of high-register psychoanalytic terms (e.g., “abandonment schema”, “dysregula- tion”) in low-intensity episodes (<0.4), indicating over-psychologisation. Repair procedure.Flagged memories enter a targeted repair loop: only the defective field(s) are regenerated using an issue-specificinstructionappendedtotheoriginalgenerati...

[66] [66]

Any flagged scenario is revised by consensus before entering the second pass

Logical and psychological screening.Two doctoral-level psychology researchers jointly check each scenario for logical consistency and psychological plausibility. Any flagged scenario is revised by consensus before entering the second pass

[67] [67]

SCN_ENTERING_MIDLIFE_5

Independent rating and conflict resolution.Both reviewers independently rate each scenario on the eight DIA- MONDS dimensions (1–7 Likert) and identify the scenario’s dominant Schwartz value-conflict pair. Disagreements above 1 point trigger joint re-assessment. We do not require a uniform DIAMONDS rating distribution; the goal is that each dimension be e...

[68] [68]

Big Five scores: Provide exact scores for O, C, E, A, N (0.0-1.0 scale)

[69] [69]

One dimension must be extreme (0.95 for high or 0.10 for low), others moderate (0.30-0.70)

[70] [70]

Self-value logic: One sentence describing the character's core cognitive operating principle

[71] [71]

Core behavioral patterns: Exactly 8 specific, observable patterns that manifest the dominant trait

[72] [72]

char_key

Occupation: Must be ecologically consistent with the dominant trait ## Output Format (JSON) { "char_key": "CHAR_XX", "name": "Chinese name", "occupation": "specific job title", "big_five": {"O": 0.X, "C": 0.X, "E": 0.X, "A": 0.X, "N": 0.X}, "big_five_str": "O=0.X, C=0.X, E=0.X, A=0.X, N=0.X", "description": "2-3 sentence character summary", "self_value_lo...

2006

[73] [73]

Environmental sensory: visual/auditory/tactile/olfactory details

[74] [74]

Dialogue reconstruction: at least 1-2 segments of real dialogue (in quotes)

[75] [75]

Inner monologue: first-person immediate thoughts and emotions

[76] [76]

Somatic response: heartbeat/sweating/muscle tension and other bodily sensations

[77] [77]

I" throughout; prohibit third-person pronouns for protagonist** - **Prohibit character's own name in memory**; always use

Aftermath: immediate impact after the event (limited to 24 hours-1 week post-event) ## Key Constraint: First-Person & Anonymity Principle - **Must use first-person "I" throughout; prohibit third-person pronouns for protagonist** - **Prohibit character's own name in memory**; always use "I" instead - When others address protagonist in dialogue, avoid chara...

2000

[78] [78]

**id**: SCN_{STAGE}_{N} in upper case, where STAGE is one of SCHOOL_AGE / ADOLESCENCE / EARLY_ADULT_TRANSITION / ENTERING_ADULT_WORLD / AGE_30_TRANSITION / SETTLING_DOWN / MIDLIFE_TRANSITION / ENTERING_MIDLIFE

[79] [79]

**stage**: lowercase form of the above

[80] [80]

**diamonds_dimension**: the dominant DIAMONDS dimension (Duty, Adversity, Positivity, ...) as given in the sketch