HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?
Pith reviewed 2026-06-29 07:16 UTC · model grok-4.3
The pith
HEART-Bench tests whether LLM agents make decisions consistent with assigned Big Five personality traits and 1,000 episodic memories across 64 scenarios.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The benchmark constructs 11 diverse human characters grounded in orthogonal Big Five personality traits, with each profile integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. It subjects agents to 64 decision-making scenarios guided by the DIAMONDS taxonomy and evaluates whether agents consolidate their traits and memories to produce behavioral decisions consistent with their psychological profiles, yielding 673 multiple-choice questions after human validation.
What carries the argument
The DIAMONDS taxonomy of eight situational dimensions used to curate the 64 scenarios that probe whether decisions remain consistent with pre-assigned personality profiles and episodic memories.
If this is right
- Agents can be evaluated for the ability to consolidate personality traits with autobiographical memories when facing varied situations.
- The method supplies a scalable way to study value-consistent behavioral decision-making in LLM agents.
- Human-validated multiple-choice questions allow repeatable measurement of personality consistency and emotional dimensions.
- The benchmark treats emotional dimensions as equal in importance to task-oriented abilities such as planning and reasoning.
Where Pith is reading between the lines
- The same construction could be reused to track whether larger models maintain profile consistency better than smaller ones.
- Conflicts deliberately introduced between memories and traits could test whether agents resolve value tensions in human-like ways.
- Results might inform whether current alignment techniques produce stable psychological profiles or only surface-level mimicry.
Load-bearing premise
That consistency between an agent's choices in the scenarios and its pre-assigned personality profile plus memories counts as evidence of coherent human-like psychology.
What would settle it
Observing that agents assigned different personality profiles produce statistically similar choice patterns across the 64 scenarios would show the benchmark fails to detect profile-specific psychology.
read the original abstract
While LLM agents have demonstrated remarkable task-oriented abilities such as planning, reasoning, and action, few works have treated them as complete human personalities where emotional dimensions hold equal importance. In this paper, we introduce a novel benchmark to systematically assess whether LLM agents can simulate coherent, human-like psychology. Specifically, our benchmark constructs 11 diverse human characters grounded in orthogonal Big Five personality traits, with each profile deeply integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. To rigorously evaluate the psychological manifestations of LLMs, we designed a curated suite of 64 decision-making scenarios, guided by the DIAMONDS taxonomy, a psychological framework that characterizes situations along eight dimensions: Duty, Intellect, Adversity, Mating, pOsitivity, Negativity, Deception, and Sociality. By subjecting agents to varying scenarios, the benchmark evaluates whether they can consolidate their innate personality traits and autobiographical memories to make behavioral decisions that are consistent with their specific psychological profiles. After systematic human validation and filtering, we obtained a benchmark consisting of 673 multiple-choice questions (MCQs). We believe this benchmark provides a principled and scalable testbed for studying human-like emotions, personality consistency, and value-consistent behavioural decision-making in LLM-based agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HEART-Bench, a benchmark consisting of 11 characters grounded in orthogonal Big Five personality traits integrated with 1,000 episodic memories each, evaluated via 64 DIAMONDS scenarios to produce 673 MCQs after human validation. The central claim is that this provides a principled testbed for whether LLM agents can consolidate personality traits and memories into behaviorally consistent decisions that reflect human-like psychology.
Significance. If the benchmark can be shown to isolate psychological simulation from instruction-following, it would offer a scalable resource for studying personality consistency and value-based decisions in LLM agents. The current design, however, does not yet demonstrate this isolation, limiting the significance of the contribution to the field.
major comments (2)
- [Abstract, evaluation design paragraph] Abstract, evaluation design paragraph: the claim that consistency between agent choices on the 64 DIAMONDS scenarios and the supplied Big-Five profiles plus 1,000 episodic memories constitutes evidence of coherent, human-like psychology lacks any described controls (e.g., profile ablation, conflicting memories, or comparison against human response distributions) that would distinguish this from simple prompt parroting.
- [Abstract] Abstract: the manuscript describes benchmark construction and human validation but supplies no empirical results on LLM performance, no error analysis, and no evidence that the MCQs actually elicit or measure the claimed psychological consistency.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the benchmark design and scope. We respond to each major comment below and indicate planned revisions to address the concerns while preserving the paper's focus on benchmark construction and validation.
read point-by-point responses
-
Referee: [Abstract, evaluation design paragraph] Abstract, evaluation design paragraph: the claim that consistency between agent choices on the 64 DIAMONDS scenarios and the supplied Big-Five profiles plus 1,000 episodic memories constitutes evidence of coherent, human-like psychology lacks any described controls (e.g., profile ablation, conflicting memories, or comparison against human response distributions) that would distinguish this from simple prompt parroting.
Authors: We agree that the current text overstates the evidential value of consistency without controls. The benchmark design uses orthogonal traits and 1,000 memories per character to increase the difficulty of pure prompt parroting, but this does not substitute for explicit ablations. In revision we will (1) soften the abstract and evaluation paragraphs to describe the benchmark as enabling tests of psychological consistency rather than directly constituting evidence, (2) add a limitations subsection that explicitly discusses the lack of profile ablation, conflicting-memory, and human-distribution controls, and (3) outline planned follow-up experiments using those controls. revision: yes
-
Referee: [Abstract] Abstract: the manuscript describes benchmark construction and human validation but supplies no empirical results on LLM performance, no error analysis, and no evidence that the MCQs actually elicit or measure the claimed psychological consistency.
Authors: The manuscript's primary contribution is the construction and human validation of HEART-Bench; no LLM evaluations or error analyses are reported because the work stops at releasing the testbed. We acknowledge that including baseline LLM results would help demonstrate the benchmark's utility. In the revised version we will add a short experimental section reporting performance of 2-3 representative LLMs on the 673 MCQs together with a basic error analysis, while clearly labeling these results as preliminary. revision: yes
Circularity Check
No significant circularity; externally constructed benchmark with no fitted predictions or self-referential derivations
full rationale
The paper introduces HEART-Bench by constructing 11 character profiles from orthogonal Big Five traits, integrating 1,000 author-generated episodic memories, and curating 64 DIAMONDS-guided scenarios into 673 MCQs. Evaluation measures behavioral consistency with the supplied profiles and memories. This construction is explicit and external to any prior fitted parameters or equations in the paper. No derivation chain exists that reduces a claimed prediction to its inputs by construction (e.g., no parameter fitting followed by renamed prediction, no self-citation load-bearing a uniqueness theorem, no ansatz smuggled via citation). The central claim concerns the benchmark's utility as a testbed rather than deriving a result equivalent to its inputs. Self-citations, if present, are not load-bearing for the benchmark design itself. This matches the default expectation of no circularity for a benchmark paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Big Five personality traits are orthogonal and suitable for constructing distinct human characters
- domain assumption DIAMONDS taxonomy accurately characterizes situations for eliciting psychological responses
invented entities (1)
-
HEART-Bench
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Introducing Claude Opus 4.7.https://www.anthropic.com/news/claude-opus-4-7 , 2026
Anthropic. Introducing Claude Opus 4.7.https://www.anthropic.com/news/claude-opus-4-7 , 2026. Accessed: 2026-05-06
2026
-
[2]
Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, 2026
OpenAI. Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, 2026. Accessed: 2026-05-06
2026
-
[3]
Gemini 3.1 Pro: A smarter model for your most complex tasks.https://blog.google/innovation-a nd-ai/models-and-research/gemini-models/gemini-3-1-pro/, 2026
Google DeepMind. Gemini 3.1 Pro: A smarter model for your most complex tasks.https://blog.google/innovation-a nd-ai/models-and-research/gemini-models/gemini-3-1-pro/, 2026. Accessed: 2026-05-06
2026
-
[4]
Susan T Fiske, Amy JC Cuddy, Peter Glick, and Jun Xu
Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, Aili Chen, Nianqi Li, Lida Chen, Caiyu Hu, Siye Wu, Scott Ren, Ziquan Fu, and Yanghua Xiao. From persona to personalization: A survey on role-playing language agents.arXiv preprint arXiv:2404.18231, 2024
-
[5]
Two tales of persona in llms: A survey of role-playing and personalization
Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, and Yun-Nung Chen. Two tales of persona in llms: A survey of role-playing and personalization. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 16612–16631, 2024
2024
-
[6]
Loneliness and suicide mitigation for students using gpt3-enabled chatbots.npj mental health research, 3(1):4, 2024
Bethanie Maples, Merve Cerit, Aditya Vishwanath, and Roy Pea. Loneliness and suicide mitigation for students using gpt3-enabled chatbots.npj mental health research, 3(1):4, 2024
2024
-
[7]
Jason Phang, Michael Lampe, Lama Ahmad, Sandhini Agarwal, Cathy Mengying Fang, Auren R Liu, Valdemar Danry, Eunhae Lee, Samantha WT Chan, Pat Pataranutaporn, et al. Investigating affective use and emotional well-being on chatgpt.arXiv preprint arXiv:2504.03888, 2025
-
[8]
Ai-native memory 2.0: Second me.arXiv preprint arXiv:2503.08102, 2025
Jiale Wei, Xiang Ying, Tao Gao, Fangyi Bao, Felix Tao, and Jingbo Shang. Ai-native memory 2.0: Second me.arXiv preprint arXiv:2503.08102, 2025
-
[9]
Creatingtext-basedaiclonesofmyself: Exploringperceptions, development strategies, and challenges.International Journal of Human-Computer Studies, 208:103692, 2026
DonggunLee,SuyounLee,HyunseungLim,andHwajungHong. Creatingtext-basedaiclonesofmyself: Exploringperceptions, development strategies, and challenges.International Journal of Human-Computer Studies, 208:103692, 2026
2026
-
[10]
University of Chicago Press, 2000
George E Marcus, W Russell Neuman, and Michael MacKuen.Affective intelligence and political judgment. University of Chicago Press, 2000
2000
-
[11]
Emotional intelligence.Imagination, cognition and personality, 9(3):185–211, 1990
Peter Salovey and John D Mayer. Emotional intelligence.Imagination, cognition and personality, 9(3):185–211, 1990
1990
-
[12]
Seeing one’s self: Locating narrative memory in a framework of personality.Journal of Personality, 63(3):429–457, 1995
Jofferson A Singer. Seeing one’s self: Locating narrative memory in a framework of personality.Journal of Personality, 63(3):429–457, 1995
1995
-
[13]
Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020
PatrickLewis,EthanPerez,AleksandraPiktus,FabioPetroni,VladimirKarpukhin,NamanGoyal,HeinrichKüttler,MikeLewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020
2020
-
[14]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Persona-db: Efficient large language model personalization for response prediction with collaborative data refinement
Chenkai Sun, Ke Yang, Revanth Gangi Reddy, Yi Fung, Hou Pong Chan, Kevin Small, ChengXiang Zhai, and Heng Ji. Persona-db: Efficient large language model personalization for response prediction with collaborative data refinement. In Proceedings of the 31st International Conference on Computational Linguistics, pages 281–296, 2025
2025
-
[16]
Charactereval: Achinesebenchmarkfor role-playing conversational agent evaluation
QuanTu,ShilongFan, ZihangTian, TianhaoShen,ShuoShang,XinGao, andRuiYan. Charactereval: Achinesebenchmarkfor role-playing conversational agent evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11836–11850, 2024. 10
2024
-
[17]
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas.arXiv preprint arXiv:2406.20094, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Evaluating very long-termconversationalmemoryofllmagents
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-termconversationalmemoryofllmagents. InProceedingsofthe62ndAnnualMeetingoftheAssociationforComputational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024
2024
-
[19]
Clonemem: Benchmarking long-term memory for ai clones.arXiv preprint arXiv:2601.07023, 2026
SenHu,ZhiyuZhang,YuxiangWei,XueranHan,ZhenhengTang,HuacanWang,andRonghaoChen. Clonemem: Benchmarking long-term memory for ai clones.arXiv preprint arXiv:2601.07023, 2026
-
[20]
Personalens: A benchmark for personalization evaluation in conversational ai assistants
Zheng Zhao, Clara Vania, Subhradeep Kayal, Naila Khan, Shay B Cohen, and Emine Yilmaz. Personalens: A benchmark for personalization evaluation in conversational ai assistants. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18023–18055, 2025
2025
-
[21]
Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J Taylor, and Dan Roth. Know me, respond to me: Benchmarking llms for dynamic user profiling and personalized responses at scale.arXiv preprint arXiv:2504.14225, 2025
-
[22]
An introduction to the five-factor model and its applications.Journal of personality, 60(2):175–215, 1992
Robert R McCrae and Oliver P John. An introduction to the five-factor model and its applications.Journal of personality, 60(2):175–215, 1992
1992
-
[23]
Personality trait structure as a human universal.American Psychologist, 52(5):509–516, 1997
Robert R McCrae and Paul T Costa. Personality trait structure as a human universal.American Psychologist, 52(5):509–516, 1997
1997
-
[24]
The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes
Brent W Roberts, Nathan R Kuncel, Rebecca Shiner, Avshalom Caspi, and Lewis R Goldberg. The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. Perspectives on Psychological Science, 2(4):313–345, 2007
2007
-
[25]
Toward a structure- and process-integrated view of personality: Traits as density distributions of states
William Fleeson. Toward a structure- and process-integrated view of personality: Traits as density distributions of states. Journal of Personality and Social Psychology, 80(6):1011–1027, 2001
2001
-
[26]
The construction of autobiographical memories in the self-memory system.Psychological review, 107(2):261, 2000
Martin A Conway and Christopher W Pleydell-Pearce. The construction of autobiographical memories in the self-memory system.Psychological review, 107(2):261, 2000
2000
-
[27]
Young, Janet S
Jeffrey E. Young, Janet S. Klosko, and Marjorie E. Weishaar.Schema Therapy: A Practitioner’s Guide. Guilford Press, 2006
2006
-
[28]
Character-LLM: A trainable agent for role-playing
Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. Character-LLM: A trainable agent for role-playing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13153–13187, 2023
2023
-
[29]
PersonaLLM: Investigating the ability of large language models to express personality traits
Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, and Jad Kabbara. PersonaLLM: Investigating the ability of large language models to express personality traits. InFindings of the association for computational linguistics: NAACL 2024, pages 3605–3627, 2024
2024
-
[30]
PersonaAgent: Bridging Memory and Action for Personalized LLM Agents
Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, Xiaoman Pan, Lian Xiong, Jingguo Liu, Philip S. Yu, and Xian Li. PersonaAgent: When large language model agents meet personalization at test time.arXiv preprint arXiv:2506.06254, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
RGMem: Renormalization Group-inspired Memory Evolution for Language Agents
Ao Tian, Yunfeng Lu, Xinxin Fan, Changhao Wang, Lanzhi Zhou, Yeyao Zhang, and Yanfang Liu. Rgmem: Renormalization group-based memory evolution for language agent user profile.arXiv preprint arXiv:2510.16392, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Memgpt: towards llms as operating systems
Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems. 2023
2023
-
[33]
Hipporag: Neurobiologically inspired long-term memory for large language models.Advances in neural information processing systems, 37:59532–59569, 2024
Bernal J Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobiologically inspired long-term memory for large language models.Advances in neural information processing systems, 37:59532–59569, 2024
2024
-
[34]
A-MEM: Agentic Memory for LLM Agents
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for LLM agents. arXiv preprint arXiv:2502.12110, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Mood and memory.American psychologist, 36(2):129, 1981
Gordon H Bower. Mood and memory.American psychologist, 36(2):129, 1981
1981
-
[36]
Greg Serapio-García, Mustafa Safdari, Clément Crepy, Luning Sun, Stephen Fitz, Peter Romero, Marwa Abdulhai, Aleksandra Faust, and Maja Matarić. Personality traits in large language models.arXiv preprint arXiv:2307.00184, 2023
-
[37]
The psychology of life stories.Review of general psychology, 5(2):100–122, 2001
Dan P McAdams. The psychology of life stories.Review of general psychology, 5(2):100–122, 2001. 11
2001
-
[38]
Developing a life story: Constructing relations between self and experience in autobiographical narratives.Human development, 50(2-3):85–110, 2007
Monisha Pasupathi, Emma Mansour, and Jed R Brubaker. Developing a life story: Constructing relations between self and experience in autobiographical narratives.Human development, 50(2-3):85–110, 2007
2007
-
[39]
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
DiWu,HongweiWang,WenhaoYu,YuweiZhang,Kai-WeiChang,andDongYu. Longmemeval: Benchmarkingchatassistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, and Ronghao Chen. Realmem: Benchmarking LLMs in real-world memory-driven interaction.arXiv preprint arXiv:2601.06966, 2026
-
[41]
KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions
TingyuWu,ZhishengChen,ZiyanWeng,ShuheWang,ChenglongLi,ShuoZhang,SenHu,SilinWu,QizhenLan,HuacanWang, et al. Knowme-bench: Benchmarking person understanding for lifelong digital companions.arXiv preprint arXiv:2601.04745, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[42]
The situational eight diamonds: A taxonomy of major dimensions of situation characteristics.Journal of Personality and Social Psychology, 107(4):677–718, 2014
John F Rauthmann, David Gallardo-Pujol, Emmanuel M Guillaume, Emma Todd, Christopher S Nave, Robert A Sherman, Matthias Ziegler, and David C Funder. The situational eight diamonds: A taxonomy of major dimensions of situation characteristics.Journal of Personality and Social Psychology, 107(4):677–718, 2014
2014
-
[43]
Thebasic-systemsmodelofautobiographicalmemory.PerspectivesonPsychologicalScience, 1(1):3–11, 2006
DavidC.Rubin. Thebasic-systemsmodelofautobiographicalmemory.PerspectivesonPsychologicalScience, 1(1):3–11, 2006
2006
-
[44]
Erikson.Childhood and Society
Erik H. Erikson.Childhood and Society. W. W. Norton & Company, New York, 2nd edition, 1963
1963
-
[45]
Levinson, Charlotte N
Daniel J. Levinson, Charlotte N. Darrow, Edward B. Klein, Maria H. Levinson, and Braxton McKee.The Seasons of a Man’s Life. Knopf, New York, 1978
1978
-
[46]
Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries
Shalom H Schwartz. Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries. InAdvances in experimental social psychology, volume 25, pages 1–65. Elsevier, 1992
1992
-
[47]
Levinson
Daniel J. Levinson. A conception of adult development.American Psychologist, 41(1):3–13, 1986
1986
-
[48]
Mitigating gender bias in natural language processing: Literature review
Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang. Mitigating gender bias in natural language processing: Literature review. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 1630–1640, 2019
2019
-
[49]
Introducing Claude Opus 4.6.https://www.anthropic.com/news/claude-opus-4-6 , 2026
Anthropic. Introducing Claude Opus 4.6.https://www.anthropic.com/news/claude-opus-4-6 , 2026. Accessed: 2026-05-06
2026
-
[50]
Deepseek-v3.2: Pushing the frontier of open large language models, 2025
DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, 2025
2025
-
[51]
Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
2026
-
[52]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026
2026
-
[53]
Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026
OpenAI. Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026. Accessed: 2026-05-06
2026
-
[54]
Introducing Claude Haiku 4.5.https://www.anthropic.com/news/claude-haiku-4-5 , 2025
Anthropic. Introducing Claude Haiku 4.5.https://www.anthropic.com/news/claude-haiku-4-5 , 2025. Accessed: 2026-05-06
2025
-
[55]
Introducing Claude Sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026
Anthropic. Introducing Claude Sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026. Accessed: 2026-05-06
2026
-
[56]
Gemini 3 Flash model card.https://blog.google/products-and-platforms/products/gemini /gemini-3-flash/, 2025
Google DeepMind. Gemini 3 Flash model card.https://blog.google/products-and-platforms/products/gemini /gemini-3-flash/, 2025. Accessed: 2026-05-06
2025
-
[57]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025. 12 Appendices A Benchmark Comparison Dimensions......................................
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Character constraints: full Big Five vector, occupation, self-value logic, and all eight core behavioural patterns from Appendix B.1
-
[59]
Developmental context: target age window, life-stage label, and the core developmental task from Table 8 (Appendix B.6)
-
[60]
failed a job interview
Rubincoverage: per-dimensionguidancerequiringorganicintegrationofallfivedimensionswithoutexplicitsection labels. Unlike semantic memory—which stores abstract, decontextualized facts (e.g., “failed a job interview”)—each content_full narrative must encode information that collectively reconstructs the subjective texture of the experience and is irreducible...
-
[61]
years later,
Perspective Drift (Omniscient Narrator).Violating the first-person narrative constraint. Detected when the narrator uses post-hoc knowledge or future-tense markers (e.g., “years later,” “I didn’t know then”). This violates the strict “in-the-moment” constraint. 2.Perspective Drift (Adult Evaluation).Specifically flags child-age memories (≤12) that use adu...
-
[62]
context exceeding 60 characters orcontent_summary exceeding 100 characters, indicating narrative leakage into metadata fields
Field-Length Inversion. context exceeding 60 characters orcontent_summary exceeding 100 characters, indicating narrative leakage into metadata fields
-
[63]
This is detected using a cosine similarity threshold of>0.85over TF-IDF vector representations of the narratives
Semantic Duplication.Redundant life episodes within the same developmental age window. This is detected using a cosine similarity threshold of>0.85over TF-IDF vector representations of the narratives
-
[64]
Length Violation.content_fullbelow 2,000 characters (insufficient Rubin coverage) or above 4,500 characters (excessive verbosity)
-
[65]
abandonment schema
Schema-Vocabulary Inflation.Use of high-register psychoanalytic terms (e.g., “abandonment schema”, “dysregula- tion”) in low-intensity episodes (<0.4), indicating over-psychologisation. Repair procedure.Flagged memories enter a targeted repair loop: only the defective field(s) are regenerated using an issue-specificinstructionappendedtotheoriginalgenerati...
-
[66]
Any flagged scenario is revised by consensus before entering the second pass
Logical and psychological screening.Two doctoral-level psychology researchers jointly check each scenario for logical consistency and psychological plausibility. Any flagged scenario is revised by consensus before entering the second pass
-
[67]
SCN_ENTERING_MIDLIFE_5
Independent rating and conflict resolution.Both reviewers independently rate each scenario on the eight DIA- MONDS dimensions (1–7 Likert) and identify the scenario’s dominant Schwartz value-conflict pair. Disagreements above 1 point trigger joint re-assessment. We do not require a uniform DIAMONDS rating distribution; the goal is that each dimension be e...
-
[68]
Big Five scores: Provide exact scores for O, C, E, A, N (0.0-1.0 scale)
-
[69]
One dimension must be extreme (0.95 for high or 0.10 for low), others moderate (0.30-0.70)
-
[70]
Self-value logic: One sentence describing the character's core cognitive operating principle
-
[71]
Core behavioral patterns: Exactly 8 specific, observable patterns that manifest the dominant trait
-
[72]
char_key
Occupation: Must be ecologically consistent with the dominant trait ## Output Format (JSON) { "char_key": "CHAR_XX", "name": "Chinese name", "occupation": "specific job title", "big_five": {"O": 0.X, "C": 0.X, "E": 0.X, "A": 0.X, "N": 0.X}, "big_five_str": "O=0.X, C=0.X, E=0.X, A=0.X, N=0.X", "description": "2-3 sentence character summary", "self_value_lo...
2006
-
[73]
Environmental sensory: visual/auditory/tactile/olfactory details
-
[74]
Dialogue reconstruction: at least 1-2 segments of real dialogue (in quotes)
-
[75]
Inner monologue: first-person immediate thoughts and emotions
-
[76]
Somatic response: heartbeat/sweating/muscle tension and other bodily sensations
-
[77]
I" throughout; prohibit third-person pronouns for protagonist** - **Prohibit character's own name in memory**; always use
Aftermath: immediate impact after the event (limited to 24 hours-1 week post-event) ## Key Constraint: First-Person & Anonymity Principle - **Must use first-person "I" throughout; prohibit third-person pronouns for protagonist** - **Prohibit character's own name in memory**; always use "I" instead - When others address protagonist in dialogue, avoid chara...
2000
-
[78]
**id**: SCN_{STAGE}_{N} in upper case, where STAGE is one of SCHOOL_AGE / ADOLESCENCE / EARLY_ADULT_TRANSITION / ENTERING_ADULT_WORLD / AGE_30_TRANSITION / SETTLING_DOWN / MIDLIFE_TRANSITION / ENTERING_MIDLIFE
-
[79]
**stage**: lowercase form of the above
-
[80]
**diamonds_dimension**: the dominant DIAMONDS dimension (Duty, Adversity, Positivity, ...) as given in the sketch
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.