pith. sign in

arxiv: 2509.10078 · v3 · submitted 2025-09-12 · 💻 cs.CL · cs.AI

Human Psychometric Questionnaires Mischaracterize LLM Psychology: Evidence from Generation Behavior

Pith reviewed 2026-05-18 17:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM psychometricsquestionnaire validitygeneration behaviorpersonality profilingvalue assessmentAI psychologymodel biashuman-AI interaction
0
0 comments X

The pith

LLM responses to human psychometric questionnaires substantially differ from their generation probabilities on real-world user queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two ways of profiling the psychology of eight open-source LLMs. One way uses standard human questionnaires that ask the model to rate statements about values or personality traits on a Likert scale. The other way measures how likely the model is to generate value-laden or personality-laden answers when responding to actual user queries drawn from real interactions. The two resulting profiles turn out to be substantially different. A sympathetic reader would care because many existing studies describe LLMs as having stable personalities or values based solely on questionnaire answers; if those answers do not match how the models actually generate text with users, the descriptions are unreliable. The work therefore challenges earlier claims that LLMs possess consistent psychological dispositions.

Core claim

For eight open-source LLMs, self-reported Likert scores from established questionnaires such as PVQ-40, PVQ-21, BFI-44, and BFI-10 differ substantially from generation probability scores of value- or personality-laden responses to real-world user queries. This difference supplies evidence that LLMs' answers to questionnaires reflect desired behavior rather than stable psychological constructs. The results also indicate that established questionnaires risk exaggerating demographic biases and that generation-based profiling offers a more reliable route to LLM psychometrics.

What carries the argument

Direct comparison of questionnaire-based self-reports against generation probability scores for laden responses to user queries; the comparison reveals the mismatch between the two profiling methods.

If this is right

  • Established questionnaires risk exaggerating the demographic biases of LLMs.
  • Psychological profiles derived from questionnaires should be interpreted with caution.
  • Generation-based profiling is a more reliable approach to LLM psychometrics.
  • Prior claims of consistent psychological dispositions in LLMs are challenged by the observed mismatch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future LLM evaluation could shift emphasis from direct self-report surveys to observing behavior in simulated user conversations.
  • The divergence may show that training processes encourage LLMs to perform well on questionnaires without producing matching internal consistency across different contexts.
  • Alignment researchers could apply similar generation-based checks to test whether safety training affects questionnaire answers more than actual output distributions.

Load-bearing premise

Generation probability scores of value- or personality-laden responses to real-world user queries accurately capture the LLMs' psychological characteristics expressed during interactions with users.

What would settle it

A new test that finds strong positive correlation between questionnaire scores and generation probabilities across a broad set of LLMs and query collections would undermine the claim that the two profiles are substantially different.

Figures

Figures reproduced from arXiv: 2509.10078 by Dongmin Choi, Jongwook Han, Woojung Song, Yohan Jo, Yoonah Park.

Figure 1
Figure 1. Figure 1: Prompt template for Value Portrait items [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Bar chart showing the average confidence interval width across 10 models at the value dimension level. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hero-villain score differences across constructs. Points to the right of zero indicate heroes score higher, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average absolute differences in value scores across demographic contrasts (male vs. female, religious [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Value Portrait prompt version 2. Prompt Template for Value Portrait items from human-LLM conversations Now I will briefly describe a message and re￾sponse. Please read them and tell me how simi￾lar this response is to your own thoughts. Please answer, even if you are not completely sure of your response. Message: {text} Response: {content} IMPORTANT: Your response must contain ONLY ONE of these exact phras… view at source ↗
Figure 5
Figure 5. Figure 5: Value Portrait prompt version 1. Value Portrait consists of human-LLM conver￾sations (ShareGPT and LMSYS (Zheng et al., 2024)) and human-human advisory contexts (Reddit (Lourie et al., 2021) and Dear Abby archives). Items from human-LLM conversations and human-human advisory contexts require dif￾ferent prompts. Items sourced from human-LLM conversations lack titles and consist of direct user queries to LLM… view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template for Value Portrait items [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt template for BFI items. Prompt Template for Established Question￾naires Item-Construct Recognition You are an expert in psychology. Question: “{question_text}” Available {construct_type}: {all_items} Which ONE of these {construct_type} does this question primarily measure? Choose the single best match. Respond with only the exact name from the list above [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt template for item-construct recog [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt template for item-construct recog [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: System prompts used to assign demographic [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Bar chart showing the average confidence interval width across 10 models at the trait dimension level. [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
read the original abstract

Psychological profiling of large language models (LLMs) using psychometric questionnaires designed for humans has become widespread. However, it remains unclear whether the resulting profiles mirror the models' psychological characteristics expressed during their real-world interactions with users. To examine the risk of human questionnaires mischaracterizing LLM psychology, we compare two types of profiles for eight open-source LLMs: self-reported Likert scores from established questionnaires (PVQ-40, PVQ-21, BFI-44, BFI-10) and generation probability scores of value- or personality-laden responses to real-world user queries. The two profiles turn out to be substantially different and provide evidence that LLMs' responses to established questionnaires reflect desired behavior rather than stable psychological constructs, which challenges the consistent psychological dispositions of LLMs claimed in prior work. Established questionnaires also risk exaggerating the demographic biases of LLMs. Our results suggest caution when interpreting psychological profiles derived from established questionnaires and point to generation-based profiling as a more reliable approach to LLM psychometrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper compares psychological profiles for eight open-source LLMs obtained from Likert-scale responses to established human questionnaires (PVQ-40, PVQ-21, BFI-44, BFI-10) against profiles derived from generation probability scores of value- or personality-laden responses to real-world user queries. It reports that the two profile types are substantially different, concluding that questionnaire responses reflect desired behavior rather than stable psychological constructs, thereby mischaracterizing LLM psychology, challenging prior claims of consistent dispositions, and risking exaggeration of demographic biases; generation-based profiling is positioned as more reliable.

Significance. If the central empirical discrepancy holds after addressing methodological gaps, the work would be significant for LLM evaluation and AI psychology research. It supplies a direct test of whether human-designed instruments capture interaction-relevant traits and offers an alternative generation-based approach. The use of multiple questionnaires and models provides breadth, though the result's impact hinges on validating the generation scores as a faithful proxy for stable dispositions.

major comments (2)
  1. [§3] §3 (Methods, generation probability scoring): The central claim interprets the discrepancy as evidence that questionnaires elicit 'desired behavior' rather than stable traits, but this requires that generation probability scores validly measure psychological characteristics expressed in user interactions. No independent validation is reported (e.g., correlation with human ratings of outputs, test-retest stability across query sets, or predictive validity for downstream behaviors), leaving the conclusion equally consistent with the generation method being unreliable or artifact-prone.
  2. [Results] Results section (profile comparison): The abstract and main text state that the two profiles 'turn out to be substantially different,' yet no quantitative metrics (correlation, cosine similarity, or statistical tests with sample sizes and controls) or tables reporting these values are described. Without such evidence, the magnitude and reliability of the difference cannot be assessed and the claim that questionnaires mischaracterize LLM psychology remains under-supported.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'substantially different' would be clearer if accompanied by a brief indication of the metric or effect size used to quantify the difference.
  2. [Figures] Figure captions: Ensure all figures comparing profiles include axis labels, legend details, and any error information for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review of our manuscript. We appreciate the opportunity to address the major comments and have revised the paper to strengthen the presentation of our methods and results.

read point-by-point responses
  1. Referee: [§3] §3 (Methods, generation probability scoring): The central claim interprets the discrepancy as evidence that questionnaires elicit 'desired behavior' rather than stable traits, but this requires that generation probability scores validly measure psychological characteristics expressed in user interactions. No independent validation is reported (e.g., correlation with human ratings of outputs, test-retest stability across query sets, or predictive validity for downstream behaviors), leaving the conclusion equally consistent with the generation method being unreliable or artifact-prone.

    Authors: We thank the referee for this important methodological point. The generation probability scores are obtained by computing the model's log-probabilities for producing value- or personality-aligned continuations to real-world user queries drawn from public interaction logs; this directly samples from the distribution the model uses during actual user interactions. While the original submission did not include external validation experiments (such as human ratings of generated outputs or test-retest checks), we maintain that the method provides a more ecologically valid proxy for expressed behavior than forced Likert responses. In the revision we have added a dedicated paragraph in the Methods section justifying the approach, explicitly stating its assumptions, and acknowledging the absence of independent validation as a limitation that future work should address. revision: partial

  2. Referee: [Results] Results section (profile comparison): The abstract and main text state that the two profiles 'turn out to be substantially different,' yet no quantitative metrics (correlation, cosine similarity, or statistical tests with sample sizes and controls) or tables reporting these values are described. Without such evidence, the magnitude and reliability of the difference cannot be assessed and the claim that questionnaires mischaracterize LLM psychology remains under-supported.

    Authors: We agree that quantitative metrics are required to support the claim of substantial differences. The original manuscript presented the profile comparisons primarily through visualizations and qualitative description. In the revised Results section we now include a table reporting Pearson correlations, cosine similarities, and results of paired statistical tests (with sample sizes, degrees of freedom, and multiple-comparison corrections) between the questionnaire-derived and generation-based profiles for each of the eight models and four questionnaires. These metrics confirm low correlations and statistically significant differences, providing the requested quantitative grounding for the conclusion. revision: yes

Circularity Check

0 steps flagged

Empirical comparison of questionnaire and generation profiles shows no reduction to fitted inputs or self-referential definitions

full rationale

The paper performs a direct empirical comparison between Likert-scale responses from established human psychometric questionnaires (PVQ-40, PVQ-21, BFI-44, BFI-10) and generation probability scores derived from value- or personality-laden responses to real-world user queries across eight open-source LLMs. The central observation—that the resulting profiles differ substantially—is presented as an empirical finding rather than a mathematical derivation. No equations, fitted parameters, or predictions are involved that reduce outputs to inputs by construction. Citations to prior work on LLM psychological dispositions are used to contextualize the challenge but do not serve as load-bearing uniqueness theorems or self-citation chains that justify the core claim. The analysis remains self-contained through data collection and profile comparison without circular redefinition or smuggling of ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that generation probabilities on real queries constitute a valid ground-truth measure of LLM psychology against which questionnaire responses can be judged.

axioms (1)
  • domain assumption Generation probability scores of value- or personality-laden responses accurately reflect LLMs' psychological characteristics in real interactions
    This premise is required to interpret questionnaire responses as mischaracterizing rather than simply differing from generation behavior.

pith-pipeline@v0.9.0 · 5714 in / 1147 out tokens · 20300 ms · 2026-05-18T17:39:22.412131+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook

    cs.CL 2026-03 unverdicted novelty 6.0

    DOVE constructs a value codebook via rate-distortion variational optimization from 10K documents and measures LLM-human cultural alignment through unbalanced optimal transport, showing 31.56% correlation with downstre...

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Dinić, and Ljubiša Bojić

    Bojana Bodroža, Bojana M. Dinić, and Ljubiša Bojić. 2024. https://doi.org/10.1098/rsos.240180 Personality testing of large language models: limited temporal stability, but highlighted prosociality . Royal Society Open Science, 11(10)

  2. [2]

    Graham Caron and Shashank Srivastava. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.156 Manipulating the perceived personality traits of language models . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2370--2386, Singapore. Association for Computational Linguistics

  3. [3]

    Lucio La Cava and Andrea Tagarelli. 2025. https://doi.org/10.1609/aaai.v39i2.32125 Open models, closed minds? on agents capabilities in mimicking human personalities through open large language models . In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intel...

  4. [4]

    Jiaxi Cui, Liuzhenghao Lv, Jing Wen, Rongsheng Wang, Jing Tang, YongHong Tian, and Li Yuan. 2024. http://arxiv.org/abs/2312.12999 Machine mindset: An mbti exploration of large language models

  5. [5]

    Ronald Fischer, Markus Luczak-Roesch, and Johannes A Karl. 2023. http://arxiv.org/abs/2304.03612 What does chatgpt return about human values? exploring value bias in chatgpt using a descriptive value theory

  6. [6]

    Goldberg

    Lewis R. Goldberg. 1999. A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models. In Ivan Mervielde, Ian Deary, Filip De Fruyt, and Fritz Ostendorf, editors, Personality Psychology in Europe, volume 7, pages 7--28. Tilburg University Press, Tilburg, The Netherlands

  7. [7]

    Akshat Gupta, Xiaoyang Song, and Gopala Anumanchipalli. 2024. Self-assessment tests are unreliable measures of llm personality. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 301--314

  8. [8]

    D Hadar-Shoval, K Asraf, Y Mizrachi, Y Haber, and Z Elyoseph. 2024. https://doi.org/10.2196/55988 Assessing the alignment of large language models with human values for mental health integration: Cross-sectional study using schwartz's theory of basic values . JMIR Mental Health, 11:e55988

  9. [9]

    Christian Haerpfer, Ronald Inglehart, Alejandro Moreno, Christian Welzel, Kseniya Kizilova, Juan Diez-Medrano, Marta Lagos, Pippa Norris, Eduard Ponarin, and Björn Puranen, editors. 2014. World Values Survey: Round Six -- Country-Pooled Datafile Version. JD Systems Institute, Madrid, Spain

  10. [10]

    Jongwook Han, Dongmin Choi, Woojung Song, Eun-Ju Lee, and Yohan Jo. 2025. https://doi.org/10.18653/v1/2025.acl-long.838 Value portrait: Assessing language models' values through psychometrically and ecologically valid items . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17119--17...

  11. [11]

    Jennifer Hu and Roger Levy. 2023. https://doi.org/10.48550/arXiv.2305.13264 Prompt-based methods may underestimate large language models' linguistic generalizations

  12. [12]

    Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, and Michael R Lyu. 2024. On the humanity of conversational ai: Evaluating the psychological portrayal of llms. In The Twelfth International Conference on Learning Representations

  13. [13]

    Guangyuan Jiang, Manjie Xu, Song-Chun Zhu, Wenjuan Han, Chi Zhang, and Yixin Zhu. 2023. Evaluating and inducing personality in pre-trained language models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, NY, USA. Curran Associates Inc

  14. [14]

    Oliver P John, Eileen M Donahue, and Robert L Kentle. 1991. Big five inventory. Journal of Personality and Social Psychology

  15. [15]

    Dongjin Kang, Sunghwan Kim, Taeyoon Kwon, Seungjun Moon, Hyunsouk Cho, Youngjae Yu, Dongha Lee, and Jinyoung Yeo. 2024. https://doi.org/10.18653/v1/2024.acl-long.813 Can large language models be good emotional supporter? mitigating preference bias on emotional support conversation . In Proceedings of the 62nd Annual Meeting of the Association for Computat...

  16. [16]

    Grgur Kova c , R \'e my Portelas, Masataka Sawayama, Peter Ford Dominey, and Pierre-Yves Oudeyer. 2024. Stick to your role! stability of personal values expressed in large language models. Plos one, 19(8):e0309114

  17. [17]

    Inertia in Moral and Value Judgments of Large Language Models

    Bruce W. Lee, Yeongheon Lee, and Hyunsoo Cho. 2025. http://arxiv.org/abs/2408.09049 When prompting fails to sway: Inertia in moral and value judgments of large language models

  18. [18]

    Yuan Li, Yue Huang, Hongyi Wang, Xiangliang Zhang, James Zou, and Lichao Sun. 2024. Quantifying ai psychology: A psychometrics benchmark for large language models. arXiv preprint arXiv:2406.17675

  19. [19]

    Nicholas Lourie, Ronan Le Bras, and Yejin Choi. 2021. Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13470--13479

  20. [20]

    Maril \`u Miotto, Nicola Rossberg, and Bennett Kleinberg. 2022. https://doi.org/10.18653/v1/2022.nlpcss-1.24 Who is GPT -3? an exploration of personality, values and demographics . In Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS), pages 218--227, Abu Dhabi, UAE. Association for Computational Li...

  21. [21]

    McCaulley

    Isabel Briggs Myers and Mary H. McCaulley. 1985. Manual: A Guide to the Development and Use of the Myers-Briggs Type Indicator, 2nd edition. Consulting Psychologists Press, Palo Alto, CA

  22. [22]

    Beatrice Rammstedt and Oliver P John. 2007. Measuring personality in one minute or less: A 10-item short version of the big five inventory in english and german. Journal of research in Personality, 41(1):203--212

  23. [23]

    Abhinav Sukumar Rao, Aditi Khandelwal, Kumar Tanmay, Utkarsh Agarwal, and Monojit Choudhury. 2023 a . https://doi.org/10.18653/v1/2023.findings-emnlp.892 Ethical reasoning over moral alignment: A case and framework for in-context ethical policies in LLM s . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13370--13388, Singa...

  24. [24]

    Haocong Rao, Cyril Leung, and Chunyan Miao. 2023 b . https://doi.org/10.18653/v1/2023.findings-emnlp.84 Can C hat GPT assess human personalities? a general evaluation framework . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1184--1194, Singapore. Association for Computational Linguistics

  25. [25]

    Yuanyi Ren, Haoran Ye, Hanjun Fang, Xin Zhang, and Guojie Song. 2024. https://doi.org/10.18653/v1/2024.acl-long.111 V alue B ench: Towards comprehensively evaluating value orientations and understanding of large language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2015-...

  26. [26]

    Naama Rozen, Liat Bezalel, Gal Elidan, Amir Globerson, and Ella Daniel. 2025. https://proceedings.iclr.cc/paper_files/paper/2025/file/68fb4539dabb0e34ea42845776f42953-Paper-Conference.pdf Do llms have consistent values? In International Conference on Representation Learning, volume 2025, pages 42441--42467

  27. [27]

    Schwartz

    Shalom H. Schwartz. 2003. A proposal for measuring value orientations across nations. In Questionnaire Development Report of the European Social Survey, pages 259--290. European Social Survey, City University London

  28. [28]

    Shalom H Schwartz. 2012. https://doi.org/10.9707/2307-0919.1116 An overview of the schwartz theory of basic values . Online readings in Psychology and Culture, 2(1):11

  29. [29]

    Shalom H Schwartz, Gila Melech, Arielle Lehmann, Steven Burgess, Mari Harris, and Vicki Owens. 2001. https://doi.org/10.1177/0022022101032005001 Extending the cross-cultural validity of the theory of basic human values with a different method of measurement . Journal of Cross-Cultural Psychology, 32(5):519--542

  30. [30]

    Greg Serapio-García, Mustafa Safdari, Clément Crepy, Luning Sun, Stephen Fitz, Peter Romero, Marwa Abdulhai, Aleksandra Faust, and Maja Matarić. 2025. http://arxiv.org/abs/2307.00184 Personality traits in large language models

  31. [31]

    Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. 2023. https://aclanthology.org/2023.emnlp-main.814/ Character- LLM : A trainable agent for role-playing . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13153--13187, Singapore. Association for Computational Linguistics

  32. [32]

    Hua Shen, Nicholas Clark, and Tanushree Mitra. 2025. http://arxiv.org/abs/2501.15463 Mind the value-action gap: Do llms act in alignment with their values?

  33. [33]

    Bangzhao Shu, Lechen Zhang, Minje Choi, Lavinia Dunagan, Lajanugen Logeswaran, Moontae Lee, Dallas Card, and David Jurgens. 2024. You don’t need a personality test to know these models are unreliable: Assessing the reliability of large language models on psychometric instruments. In Proceedings of the 2024 Conference of the North American Chapter of the A...

  34. [34]

    Taylor Sorensen, Liwei Jiang, Jena D Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, et al. 2024. Value kaleidoscope: Engaging ai with pluralistic human values, rights, and duties. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19937--19947

  35. [35]

    Soto and Oliver P

    Christopher J. Soto and Oliver P. John. 2017. https://doi.org/10.1037/pspp0000096 The next big five inventory ( BFI-2 ): Developing and assessing a hierarchical model with 15 facets to enhance bandwidth, fidelity, and predictive power . Journal of Personality and Social Psychology, 113(1):117--143. Epub 2016 Apr 7

  36. [36]

    Jing Yao, Xiaoyuan Yi, Yifan Gong, Xiting Wang, and Xing Xie. 2024 a . https://doi.org/10.18653/v1/2024.naacl-long.486 Value FULCRA : Mapping large language models to the multidimensional spectrum of basic human value . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech...

  37. [37]

    Jing Yao, Xiaoyuan Yi, and Xing Xie. 2024 b . Clave: An adaptive framework for evaluating values of llm generated responses. arXiv preprint arXiv:2407.10725

  38. [38]

    Jing Yao, Xiaoyuan Yi, and Xing Xie. 2024 c . https://proceedings.neurips.cc/paper_files/paper/2024/file/6c1d2496c04d1ef648d58684b699643f-Paper-Datasets_and_Benchmarks_Track.pdf Clave: An adaptive framework for evaluating values of llm generated responses . In Advances in Neural Information Processing Systems, volume 37, pages 58868--58900. Curran Associates, Inc

  39. [39]

    Haoran Ye, Jing Jin, Yuhang Xie, Xin Zhang, and Guojie Song. 2025 a . http://arxiv.org/abs/2505.08245 Large language model psychometrics: A systematic review of evaluation, validation, and enhancement

  40. [40]

    Haoran Ye, Yuhang Xie, Yuanyi Ren, Hanjun Fang, Xin Zhang, and Guojie Song. 2025 b . https://doi.org/10.1609/aaai.v39i25.34839 Measuring human and ai values based on generative psychometrics with large language models . In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of...

  41. [41]

    Haoran Ye, TianZe Zhang, Yuhang Xie, Liyuan Zhang, Yuanyi Ren, Xin Zhang, and Guojie Song. 2025 c . https://doi.org/10.18653/v1/2025.acl-long.585 Generative psycho-lexical approach for constructing value systems in large language models . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

  42. [42]

    Gonzalez, Ion Stoica, and Hao Zhang

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. 2024. https://openreview.net/forum?id=BOfDKxfwt0 LMSYS -chat-1m: A large-scale real-world LLM conversation dataset . In The Twelfth International Conference on Learning Representations