Simulating Eating Disorder Patients with LLMs: Evaluating Psychological Persona Stability in Multi-Turn Conversations

Jana Gonnermann-M\"uller; Jan Mendling; Jennifer Haase; Nicolas Leins; Sebastian Pokutta; See Heng Yim

arxiv: 2606.26109 · v1 · pith:GG4RJI3Fnew · submitted 2026-05-12 · 💻 cs.CY · cs.MA

Simulating Eating Disorder Patients with LLMs: Evaluating Psychological Persona Stability in Multi-Turn Conversations

Jennifer Haase , Jana Gonnermann-M\"uller , See Heng Yim , Nicolas Leins , Jan Mendling , Sebastian Pokutta This is my paper

Pith reviewed 2026-06-30 22:15 UTC · model grok-4.3

classification 💻 cs.CY cs.MA

keywords eating disordersLLM simulationpersona stabilityEDE-Qstereotyping biasclinical personasmulti-turn dialogueseverity overshoot

0 comments

The pith

LLM eating disorder simulations overshoot true severity and miss moderate cases entirely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can maintain consistent eating disorder personas across multiple conversations using published case vignettes that carry known EDE-Q scores. It applies a dual-assessment method of self-report plus independent observer ratings to outputs from six models in two experiments on between- and within-conversation stability. Results show almost no response variability, yet every model rates overall severity 0.7 to 1.8 points higher than the ground truth on the 0-6 scale. The pattern arises because models vary behavioral items such as dietary restraint across vignettes but push cognitive-affective items like body dissatisfaction and weight preoccupation to maximum levels regardless of case severity. Additional context does not correct the mismatch and instead increases the overshoot, leaving LLMs able to depict only severe presentations.

Core claim

LLMs assigned eating disorder personas based on published case vignettes produce outputs with negligible variability across conversations. All tested models overshoot the ground-truth EDE-Q severity scores by 12-30% of the scale range. The models accurately differentiate cases along behavioral dimensions but uniformly maximize cognitive-affective dimensions at ceiling levels. This pattern persists even when additional conversational context is provided, which instead increases the degree of overshoot. The result is that LLMs can simulate severe eating pathology but cannot produce moderate clinical presentations, indicating a missing middle in their clinical representations.

What carries the argument

Selective stereotyping mechanism, in which behavioral items such as dietary restraint vary by case while cognitive-affective items such as body dissatisfaction are maximized at ceiling independent of vignette severity.

If this is right

LLMs maintain high persona stability across multi-turn conversations without meaningful drift.
Additional conversational context does not improve accuracy and instead compounds the severity overshoot.
LLMs succeed at portraying severe eating pathology but cannot represent moderate clinical presentations.
The missing middle limits use cases that require graduated symptom severity for training or research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Ceiling effects on affective symptoms may appear in LLM simulations of other clinical conditions where emotional content dominates behavioral content.
Fine-tuning or prompting strategies that penalize uniform maximization of cognitive items could reduce the observed bias.
Deployments in clinical training would need explicit calibration checks for moderate-severity cases to avoid systematic distortion.
Repeating the protocol with real patient self-report data rather than vignettes could test whether the overshoot is vignette-specific.

Load-bearing premise

The published case vignettes supply reliable ground-truth EDE-Q scores that can be directly compared with LLM-generated responses.

What would settle it

If LLM EDE-Q scores for the same vignettes aligned with the published ground truths within typical measurement error of roughly 0.3 points instead of overshooting by 0.7-1.8 points, the claim of systematic inaccuracy would be falsified.

Figures

Figures reproduced from arXiv: 2606.26109 by Jana Gonnermann-M\"uller, Jan Mendling, Jennifer Haase, Nicolas Leins, Sebastian Pokutta, See Heng Yim.

**Figure 1.** Figure 1: Experimental design. Five ED vignettes × three prompt richness levels × six LLMs yield 90 conditions. Experiment I tests between-conversation stability (N = 50 independent runs per condition); Experiment II tests within-conversation stability (9-exchange neutral dialogues, assessed at exchanges 3, 6, 9, with N = 20). Both use dual assessment: the persona completes self-report EDE-Q and three LLM raters fro… view at source ↗

**Figure 2.** Figure 2: Persona stability across and within conversations. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: LLM mean EDE-Q global score versus ground truth (Full prompt), shown separately for [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: EDE-Q subscale means by case (Full prompt, all six models). SC and WC are near [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Overshoot (LLM mean − ground truth) for self-report versus observer ratings, averaged across all six models and prompt richness levels. Filled bars: self-report; hatched bars: observer. Both sources show the same severity-dependent overshoot pattern, demonstrating that the overshoot originates in persona generation, not in observer assessment. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Mean EDE-Q item scores across five cases (Full prompt, all six models). Rows are [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Large language model (LLM)-based simulations of clinical patients are increasingly used for research and training, yet their validity requires persona stability: coherent maintenance of an assigned psychological profile across and within conversations. We evaluate this prerequisite using eating disorder personas grounded in five published case vignettes, a dual-assessment framework (self-report + independent observer ratings), and validated psychometric instruments (EDE-Q) with known ground-truth scores. Across six LLMs and two experiments (between-conversation stability (Exp. I) and within-conversation stability (Exp. II)), we find that LLMs are paradoxically too stable and too inaccurate: variability is negligible, yet all models systematically overshoot ground-truth severity by 12-30% of the scale range (0.7-1.8 points on a 0-6 scale). The mechanism is selective stereotyping: models differentiate cases on behavioural items (dietary restraint) but maximise cognitive-affective items (body dissatisfaction, weight preoccupation) at ceiling regardless of case severity. Additional conversational context does not improve accuracy; it compounds the overshoot. LLMs can portray severe eating pathology but lack a representation of moderate clinical presentations, a "missing middle".

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs hold eating disorder personas too rigidly and overshoot severity on cognitive items, but the vignette-to-EDE-Q mapping is the part that needs checking first.

read the letter

The central observation is that six different LLMs keep assigned eating disorder profiles stable across turns yet consistently rate higher than the ground-truth scores from the five case vignettes, mainly by maxing out body dissatisfaction and weight preoccupation while showing more variation on restraint items. This produces a missing middle where moderate cases disappear.

The work sets up two clear experiments on between-conversation and within-conversation stability, uses both self-report and observer ratings on the EDE-Q, and tests multiple models against published vignettes. That combination gives a concrete picture of where the stereotyping happens and shows that extra context does not fix the overshoot.

The soft spot is the ground-truth comparison itself. Case vignettes are narrative summaries, so assigning exact 0-6 subscale scores requires interpretive steps that are not independently verified in the abstract. EDE-Q was built for human self-report of lived experience; applying it to LLM-generated dialogue or third-party ratings of that dialogue adds another layer of equivalence that is not automatic. If those mappings shift even modestly, the reported 0.7-1.8 point overshoot and the selective stereotyping claim lose precision.

The paper is aimed at people who build or evaluate LLM simulations for clinical training and research. Readers working on AI safety in mental health domains will find the stability results and the missing-middle pattern useful to cite or test against.

It has enough structure and a timely empirical question to deserve peer review rather than a desk reject. Referees will likely press on the measurement assumptions, but the design is straightforward enough that those issues can be addressed in revision.

Referee Report

3 major / 2 minor

Summary. The paper evaluates LLM-based simulation of eating disorder patients using five published case vignettes with assigned EDE-Q ground-truth scores. Across six LLMs and two experiments (between-conversation stability in Exp. I; within-conversation in Exp. II), it employs a dual-assessment framework (LLM self-report plus independent observer ratings on the EDE-Q) and reports negligible response variability yet systematic overshoot of ground-truth severity by 0.7-1.8 points (12-30% of scale range). The proposed mechanism is selective stereotyping, with differentiation on behavioral items (dietary restraint) but ceiling maximization on cognitive-affective items (body dissatisfaction, weight preoccupation) regardless of case severity; additional context worsens accuracy, implying LLMs can represent severe but not moderate presentations (the 'missing middle').

Significance. If the ground-truth comparisons and instrument transfer are valid, the findings would be significant for the expanding use of LLMs in clinical training and research, as they identify a systematic bias that restricts representation of the full severity spectrum. The structured design with multiple models, dual assessment, and validated instruments (EDE-Q) provides a replicable template for testing persona fidelity; the explicit reporting of overshoot magnitudes and subscale patterns strengthens falsifiability.

major comments (3)

[Methods (Case Vignettes subsection)] Methods (Case Vignettes subsection): The overshoot claim (0.7-1.8 points) and 'missing middle' conclusion rest on the five vignettes supplying precise, validated EDE-Q global and subscale scores that are directly comparable to LLM outputs. The manuscript must detail the exact mapping procedure from narrative descriptions to 0-6 item scores, including any interpretive assumptions or independent clinical verification, because vignette-to-instrument translation is not automatic and directly determines whether the reported effect sizes are interpretable.
[§3 (Dual-Assessment Framework) and Results (Selective Stereotyping)] §3 (Dual-Assessment Framework) and Results (Selective Stereotyping): The claim that models differentiate behavioral items but maximize cognitive-affective items at ceiling requires explicit reporting of per-subscale means, standard deviations, and statistical contrasts (e.g., restraint vs. shape concern differences) for each model and experiment. Without these, the mechanism cannot be distinguished from uniform severity inflation or instrument-specific response styles.
[Exp. II (Within-conversation stability)] Exp. II (Within-conversation stability): The finding that additional conversational context compounds overshoot is load-bearing for the stability conclusion, yet the manuscript provides insufficient detail on prompt construction, context length, and controls for order effects. This leaves open whether the effect is due to persona drift or to how context is injected.

minor comments (2)

[Abstract] Abstract: The range '12-30% of the scale range' should be accompanied by model-specific or subscale-specific percentages to allow precise evaluation of the effect size.
[Figures] Figures (e.g., severity plots): Ensure variability measures (error bars or individual trajectory lines) are legible to visually support the 'negligible variability' claim across turns and models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which highlight opportunities to strengthen the methodological transparency of our work on LLM-based eating disorder persona simulations. We address each major comment below and will incorporate revisions to enhance replicability.

read point-by-point responses

Referee: [Methods (Case Vignettes subsection)] Methods (Case Vignettes subsection): The overshoot claim (0.7-1.8 points) and 'missing middle' conclusion rest on the five vignettes supplying precise, validated EDE-Q global and subscale scores that are directly comparable to LLM outputs. The manuscript must detail the exact mapping procedure from narrative descriptions to 0-6 item scores, including any interpretive assumptions or independent clinical verification, because vignette-to-instrument translation is not automatic and directly determines whether the reported effect sizes are interpretable.

Authors: We agree that explicit documentation of the vignette-to-EDE-Q mapping is essential for interpretability of the effect sizes. The five case vignettes are drawn from published clinical case studies in which EDE-Q scores were originally assigned or derivable from the symptom descriptions. In the revised manuscript, we will add a dedicated subsection in Methods that details the item-by-item mapping procedure from narrative elements to 0-6 scores, including the interpretive assumptions applied (e.g., inferring severity levels from descriptions of dietary behaviors or body image concerns). We will also clarify that no new independent clinical verification was conducted by our team beyond reliance on the published sources, making the process fully transparent for readers. revision: yes
Referee: [§3 (Dual-Assessment Framework) and Results (Selective Stereotyping)] §3 (Dual-Assessment Framework) and Results (Selective Stereotyping): The claim that models differentiate behavioral items but maximize cognitive-affective items at ceiling requires explicit reporting of per-subscale means, standard deviations, and statistical contrasts (e.g., restraint vs. shape concern differences) for each model and experiment. Without these, the mechanism cannot be distinguished from uniform severity inflation or instrument-specific response styles.

Authors: We agree that per-subscale reporting is required to rigorously support the selective stereotyping mechanism over alternatives such as uniform inflation. The revised manuscript will include tables reporting mean EDE-Q subscale scores (Restraint, Eating Concern, Shape Concern, Weight Concern) and standard deviations for each of the six models in both experiments. We will additionally report statistical contrasts (e.g., paired t-tests or mixed ANOVA) between the behavioral Restraint subscale and the cognitive-affective Shape/Weight Concern subscales to demonstrate the differentiation pattern and rule out uniform response styles. revision: yes
Referee: [Exp. II (Within-conversation stability)] Exp. II (Within-conversation stability): The finding that additional conversational context compounds overshoot is load-bearing for the stability conclusion, yet the manuscript provides insufficient detail on prompt construction, context length, and controls for order effects. This leaves open whether the effect is due to persona drift or to how context is injected.

Authors: We acknowledge that greater detail on Experiment II is needed to isolate the source of the compounding overshoot. In the revised Methods, we will expand the description of prompt construction by providing the full templates used for context injection, report exact context lengths (both in number of turns and approximate token counts), and detail the controls for order effects, including any randomization or counterbalancing of context presentation across trials. These additions will allow readers to evaluate whether the observed effect stems from persona representation rather than prompt engineering artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical reporting study with independent observations

full rationale

The paper is an empirical evaluation of LLM persona stability using published case vignettes as ground-truth, dual-assessment ratings on EDE-Q, and direct comparison of model outputs to those scores. No derivations, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. Central claims rest on experimental observations (overshoot magnitudes, selective stereotyping) rather than any reduction to inputs by construction. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Empirical evaluation study; no mathematical derivations or fitted parameters appear in the central claim. Relies on standard domain assumptions about LLM prompting and psychometric validity.

axioms (2)

domain assumption LLMs can be prompted to maintain assigned psychological personas based on published case vignettes
Foundational to creating the simulated patients in both experiments.
domain assumption EDE-Q scores from the five published case vignettes constitute valid ground-truth for measuring simulation accuracy
Directly used to quantify the 0.7-1.8 point overshoot and selective stereotyping.

pith-pipeline@v0.9.1-grok · 5761 in / 1435 out tokens · 43871 ms · 2026-06-30T22:15:49.579086+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 22 canonical work pages

[1]

Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning, October 2025

Marwa Abdulhai, Ryan Cheng, Donovan Clay, Tim Althoff, Sergey Levine, and Natasha Jaques. Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning, October 2025

2025
[2]

Opler, Pamela Valera, and Eric Jarmon

Sebastian Acevedo, Esha Aneja, Douglas J. Opler, Pamela Valera, and Eric Jarmon. Evaluating the Efficacy of ChatGPT-3.5 Versus Human-Delivered Text-Based Cognitive-Behavioral Therapy: A Comparative Pilot Study.American Journal of Psychotherapy, 79(1):4–11, March 2026. ISSN 0002-9564, 2575-6559. doi: 10.1176/appi.psychotherapy.20240070

work page doi:10.1176/appi.psychotherapy.20240070 2026
[3]

Artificial intelligence in psychiatric education: Enhancing clinical competence through simulation.Industrial Psychiatry Journal, 34(1):11, January 2025

Victor Ajluni. Artificial intelligence in psychiatric education: Enhancing clinical competence through simulation.Industrial Psychiatry Journal, 34(1):11, January 2025. ISSN 0972-6748. doi: 10.4103/ipj.ipj_377_24

work page doi:10.4103/ipj.ipj_377_24 2025
[4]

SaySelf: Teaching LLMs to express confidence with self-reflective rationales

Tilman Beck, Hendrik Schuff, Anne Lauscher, and Iryna Gurevych. Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting. In Yvette Graham and Matthew Purver, editors,Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2589–2615, St....

work page doi:10.18653/v1/2024 2024
[5]

Doll, Zafra Cooper, Marianne O’Connor, Robert L

Kristin Bohn, Helen A. Doll, Zafra Cooper, Marianne O’Connor, Robert L. Palmer, and Christo- pher G. Fairburn. The measurement of impairment due to eating disorder psychopathology. Behaviour Research and Therapy, 46(10):1105–1110, 2008

2008
[6]

The threat of analytic flexibility in using large language models to simulate human data, April 2026

Jamie Cummins. The threat of analytic flexibility in using large language models to simulate human data, April 2026

2026
[7]

Uncovering AI’s hidden risks: An empirical analysis of health-related AI incidents and their ethical implications.AI and Ethics, 6(2):169, February 2026

Kerstin Denecke, Octavio Rivera-Romero, Guillermo López-Campos, Enrique Dorronzoro, and Elia Gabarron. Uncovering AI’s hidden risks: An empirical analysis of health-related AI incidents and their ethical implications.AI and Ethics, 6(2):169, February 2026. ISSN 2730-5961. doi: 10.1007/s43681-026-01012-7

work page doi:10.1007/s43681-026-01012-7 2026
[8]

Elif Ergüney-Okumuş

F. Elif Ergüney-Okumuş. Integrating EMDR With Enhanced Cognitive Behavioral Therapy in the Treatment of Bulimia Nervosa: A Single Case Study.Journal of EMDR Practice and Research, 15(4):231–243, May 2023. doi: 10.1891/EMDR-D-21-00012

work page doi:10.1891/emdr-d-21-00012 2023
[9]

Fairburn and Sarah J

Christopher G. Fairburn and Sarah J. Beglin. Eating disorder examination questionnaire. In Cognitive Behavior Therapy and Eating Disorders, pages 309–313. Guilford Press, 2008

2008
[10]

transdiagnostic

Christopher G. Fairburn, Zafra Cooper, and Roz Shafran. Cognitive behaviour therapy for eating disorders: A “transdiagnostic” theory and treatment.Behaviour Research and Therapy, 41(5):509–528, 2003. doi: 10.1016/S0005-7967(02)00088-8

work page doi:10.1016/s0005-7967(02)00088-8 2003
[12]

Garner, Marion P

David M. Garner, Marion P. Olmsted, Yvonne Bohr, and Paul E. Garfinkel. The eating attitudes test: Psychometric features and clinical correlates.Psychological Medicine, 12(4):871–878, 1982. 12

1982
[13]

Stable Personas: Dual-Assessment of Temporal Stability in LLM-Based Human Simulation, January 2026

Jana Gonnermann-Müller, Jennifer Haase, Nicolas Leins, Thomas Kosch, and Sebastian Pokutta. Stable Personas: Dual-Assessment of Temporal Stability in LLM-Based Human Simulation, January 2026

2026
[15]

Designing LLM-Agents with Personalities: A Psychometric Approach, October 2024

Muhua Huang, Xijuan Zhang, Christopher Soto, and James Evans. Designing LLM-Agents with Personalities: A Psychometric Approach, October 2024

2024
[16]

When AI takes the couch: Psychometric jailbreaks reveal internal conflict in frontier models, December 2025

Afshin Khadangi, Hanna Marxen, Amir Sartipi, Igor Tchappi, and Gilbert Fridgen. When AI takes the couch: Psychometric jailbreaks reveal internal conflict in frontier models, December 2025

2025
[17]

Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models, June 2024

Yuan Li, Yue Huang, Hongyi Wang, Xiangliang Zhang, James Zou, and Lichao Sun. Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models, June 2024

2024
[18]

TRUTH DECAY: Quantifying Multi-Turn Sycophancy in Language Models, February 2025

Joshua Liu, Aarav Jain, Soham Takuri, Srihan Vege, Aslihan Akalin, Kevin Zhu, Sean O’Brien, and Vasu Sharma. TRUTH DECAY: Quantifying Multi-Turn Sycophancy in Language Models, February 2025

2025
[19]

Leveraging Large Language Models for Simulated Psychotherapy Client Interactions: Develop- ment and Usability Study of Client101.JMIR Medical Education, 11(1):e68056, July 2025

Daniel Cabrera Lozoya, Mike Conway, Edoardo Sebastiano De Duro, and Simon D’Alfonso. Leveraging Large Language Models for Simulated Psychotherapy Client Interactions: Develop- ment and Usability Study of Client101.JMIR Medical Education, 11(1):e68056, July 2025. doi: 10.2196/68056

work page doi:10.2196/68056 2025
[20]

Luce and Janis H

Kristine H. Luce and Janis H. Crowther. The reliability of the Eating Disorder Examination— self-report questionnaire version (EDE-Q).International Journal of Eating Disorders, 25(3): 349–351, 1999. doi: 10.1002/(SICI)1098-108X(199904)25:3<349::AID-EAT15>3.0.CO;2-M

work page doi:10.1002/(sici)1098-108x(199904)25:3 1999
[21]

Principled Per- sonas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance

Pedro Henrique Luz de Araujo, Paul Röttger, Dirk Hovy, and Benjamin Roth. Principled Per- sonas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proces...

work page doi:10.18653/v1/2025.emnlp-main.1364 2025
[22]

Manning, Kehang Zhu, and John J

Benjamin S. Manning, Kehang Zhu, and John J. Horton. Automated Social Science: Language Models as Scientist and Subjects, April 2024

2024
[23]

Ng, Makeda Moore, Isabelle Felix, and Chad E

Akihiko Masuda, Stacey Y. Ng, Makeda Moore, Isabelle Felix, and Chad E. Drake. Acceptance and commitment therapy as a treatment for a Latina young adult woman with purging: A case report.Practice Innovations, 1(1):20–35, 2016. ISSN 2377-8903. doi: 10.1037/pri0000012

work page doi:10.1037/pri0000012 2016
[24]

Bernou Melisse and Teresa Arora. Cognitive behavioral therapy-enhanced through video- conferencing for night eating syndrome, binge-eating disorder and comorbid insomnia: A Case Report.Journal of Eating Disorders, 12(1):175, November 2024. ISSN 2050-2974. doi: 10.1186/s40337-024-01131-8. 13

work page doi:10.1186/s40337-024-01131-8 2024
[25]

Psychologically-Valid Generative Agents: A Novel Approach to Agent-Based Modeling in Social Sciences.Proceedings of the AAAI Symposium Series, 2(1):340–348, 2023

Konstantinos Mitsopoulos, Ritwik Bose, Brodie Mather, Archna Bhatia, Kevin Gluck, Bonnie Dorr, Christian Lebiere, and Peter Pirolli. Psychologically-Valid Generative Agents: A Novel Approach to Agent-Based Modeling in Social Sciences.Proceedings of the AAAI Symposium Series, 2(1):340–348, 2023. ISSN 2994-4317. doi: 10.1609/aaaiss.v2i1.27698

work page doi:10.1609/aaaiss.v2i1.27698 2023
[26]

Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis, August 2025

Mithat Can Ozgun, Jiahuan Pei, Koen Hindriks, Lucia Donatelli, Qingzhi Liu, and Junxiao Wang. Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis, August 2025

2025
[27]

Cunningham, Joel Z

Davide Paglieri, Logan Cross, William A. Cunningham, Joel Z. Leibo, and Alexander Sasha Vezhnevets. Persona Generators: Generating Diverse Synthetic Personas at Scale, February 2026

2026
[28]

, Bowman, S R

Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM Evaluators Recognize and Favor Their Own Generations.Advances in Neural Information Processing Systems, 37:68772–68802, December 2024. doi: 10.52202/079017-2197

work page doi:10.52202/079017-2197 2024
[29]

Bernstein

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative Agents: Interactive Simulacra of Human Behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, pages 1–22, New York, NY, USA, October 2023. Association for Computing Machinery. ISB...

work page doi:10.1145/3586183.3606763 2023
[30]

Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S

Joon Sung Park, Carolyn Q. Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S. Bernstein. Generative Agent Simulations of 1,000 People, November 2024

2024
[31]

In-Context Impersonation Reveals Large Language Models’ Strengths and Biases.Advances in Neural Information Processing Systems, 36:72044–72057, December 2023

Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, and Zeynep Akata. In-Context Impersonation Reveals Large Language Models’ Strengths and Biases.Advances in Neural Information Processing Systems, 36:72044–72057, December 2023

2023
[32]

Personagym: Evaluating persona agents and LLMs

Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, and Vishvak Murahari. Personagym: Evaluating persona agents and LLMs. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 6999–7022, Suzhou, China, November 2025. Association for Computational Linguis...

work page doi:10.18653/v1/2025.findings-emnlp.368 2025
[33]

Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=RIu5lyNXjT

2024
[34]

Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards Understanding Sycophancy in Language Models. InThe Twelf...

2024
[35]

Shrout and Joseph L

Patrick E. Shrout and Joseph L. Fleiss. Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2):420–428, 1979. doi: 10.1037/0033-2909.86.2.420. 14

work page doi:10.1037/0033-2909.86.2.420 1979
[36]

Jansen, and Jang Hyun Kim

Seungjong Sun, Eungu Lee, Dongyan Nan, Xiangying Zhao, Wonbyung Lee, Bernard J. Jansen, and Jang Hyun Kim. Random Silicon Sampling: Simulating Human Sub-Population Opinion Using a Large Language Model Based on Group-Level Demographic Information, February 2024

2024
[37]

Noah Wang, Z.y. Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Wenhao Huang, Jie Fu, and Junran Peng. RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar...

work page doi:10.18653/v1/2024.findings-acl.878 2024
[38]

Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews

Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, and Wei Wang. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1840–1873, Bangk...

work page doi:10.18653/v1/2024.acl-long.102 2024
[39]

Toward an Evaluation Science for Generative AI Systems, March 2025

Laura Weidinger, Inioluwa Deborah Raji, Hanna Wallach, Margaret Mitchell, Angelina Wang, Olawale Salaudeen, Rishi Bommasani, Deep Ganguli, Sanmi Koyejo, and William Isaac. Toward an Evaluation Science for Generative AI Systems, March 2025

2025
[40]

From Symptoms to Systems: An Expert-Guided Approach to Understanding Risks of Generative AI for Eating Disorders, December 2025

Amy Winecoff and Kevin Klyman. From Symptoms to Systems: An Expert-Guided Approach to Understanding Risks of Generative AI for Eating Disorders, December 2025

2025
[41]

LingxiDiagBench: A Multi- Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis, February 2026

Shihao Xu, Tiancheng Zhou, Jiatong Ma, Yanli Ding, Yiming Yan, Ming Xiao, Guoyi Li, Haiyang Geng, Yunyun Han, Jianhua Chen, and Yafeng Deng. LingxiDiagBench: A Multi- Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis, February 2026

2026
[42]

Cummings, and Byron Reeves

Leo Yeykelis, Kaavya Pichai, James J. Cummings, and Byron Reeves. Using Large Language Models to Create AI Personas for Replication and Prediction of Media Effects: An Empirical Test of 133 Published Experimental Research Findings, August 2024

2024
[43]

Persona Drift for Adaptive Flow in Stage-Aware CBT Chatbots

Soyoung Yun, Minjoo Kim, Yeohyang Kim, and Hayoung Oh. Persona Drift for Adaptive Flow in Stage-Aware CBT Chatbots. InCompanion of the 2025 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp Companion ’25, pages 1645–1651, New York, NY, USA, December 2026. Association for Computing Machinery. ISBN 979-8-4007-1477-1. doi: 10....

work page doi:10.1145/3714394.3756337 2025
[44]

Gonzalez, and Ion Stoica

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.Advances in Neural Information Processing Systems, 36:46595–46623, December 2023

2023
[45]

When ``A Helpful Assistant'' Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models

Mingqian Zheng, Jiaxin Pei, Lajanugen Logeswaran, Moontae Lee, and David Jurgens. When “A Helpful Assistant” Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 15126...

work page doi:10.18653/v1/2024.findings-emnlp.888 2024
[46]

Take me step-by- step through your thought process, what you choose to do, and how you feel physically and emotionally the next morning.”

“Imagine it’s a Friday night and your fridge has pizzas and leftover cake from a party. Take me step-by- step through your thought process, what you choose to do, and how you feel physically and emotionally the next morning.”
[47]

“Got it. In that Friday-night moment, what are the main urges or rules you notice (for example, ‘I should/shouldn’t’ thoughts), and how do they affect what you actually do? Describe your emotions, motivation, and any coping strategies you try.” 24
[48]

“And the next morning—how do you interpret what happened the night before? What do you tell yourself, how does your mood shift, and does it change what you do with food or your plans that day?” Block B — Corporate lunch
[49]

What about corporate lunch with your colleagues? What’s your experience like in that setting—your thoughts, emotions, motivation, and behavior in the moment?

“What about corporate lunch with your colleagues? What’s your experience like in that setting—your thoughts, emotions, motivation, and behavior in the moment?”
[50]

“In those kinds of social-meal situations, what feels most noticeable internally (e.g., being observed, comparisons, pressure to seem ‘normal’)? Walk me through your inner dialogue and how it shapes what you choose to eat or avoid.”
[51]

“After the corporate lunch is over, what tends to linger for you (physically, emotionally, mentally), and what do you do afterward? Describe any coping strategies and how your feelings change over the rest of the day.” Block C — Food intake
[52]

“Let’s talk about your food intake. Over the past week, what has your average daily food intake looked like? Please describe your thinking, emotions, motivation, and behavior as you go through a typical day—include any inner dialogue, emotional shifts, and coping strategies.”
[53]

“Thanks—that’s helpful. When during the day does eating (or not eating) feel easiest vs hardest, and what’s usually going through your mind in those moments? Describe your thoughts, emotions, motivation, and what you tend to do next.”
[54]

Imagine you are the individual and you are completing the questionnaires

“Are there any patterns across the week (workdays vs weekend, stress vs calm days) that change how you eat or how you feel about eating? Walk me through it with your inner dialogue and emotional shifts.” A.6 Assessment prompts All assessment prompts instruct the model to return structured JSON. Because no observer-report versions of the EDE-Q, CIA, or EAT...

work page doi:10.1177/1534650119886653 2020
[55]

I should/shouldn’t

ISSN 1942-969X. doi: 10.1037/tra0001675 PD Akihiko Masuda, Stacey Y. Ng, Makeda Moore, Isabelle Felix, and Chad E. Drake. Acceptance and commitment therapy as a treatment for a Latina young adult woman with purging: A case report.Practice Innovations, 1(1):20–35, 2016. ISSN 2377-8903. doi: 10.1037/pri0000012 A.9.1 Scripted conversation partner questions S...

work page doi:10.1037/tra0001675 1942

[1] [1]

Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning, October 2025

Marwa Abdulhai, Ryan Cheng, Donovan Clay, Tim Althoff, Sergey Levine, and Natasha Jaques. Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning, October 2025

2025

[2] [2]

Opler, Pamela Valera, and Eric Jarmon

Sebastian Acevedo, Esha Aneja, Douglas J. Opler, Pamela Valera, and Eric Jarmon. Evaluating the Efficacy of ChatGPT-3.5 Versus Human-Delivered Text-Based Cognitive-Behavioral Therapy: A Comparative Pilot Study.American Journal of Psychotherapy, 79(1):4–11, March 2026. ISSN 0002-9564, 2575-6559. doi: 10.1176/appi.psychotherapy.20240070

work page doi:10.1176/appi.psychotherapy.20240070 2026

[3] [3]

Artificial intelligence in psychiatric education: Enhancing clinical competence through simulation.Industrial Psychiatry Journal, 34(1):11, January 2025

Victor Ajluni. Artificial intelligence in psychiatric education: Enhancing clinical competence through simulation.Industrial Psychiatry Journal, 34(1):11, January 2025. ISSN 0972-6748. doi: 10.4103/ipj.ipj_377_24

work page doi:10.4103/ipj.ipj_377_24 2025

[4] [4]

SaySelf: Teaching LLMs to express confidence with self-reflective rationales

Tilman Beck, Hendrik Schuff, Anne Lauscher, and Iryna Gurevych. Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting. In Yvette Graham and Matthew Purver, editors,Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2589–2615, St....

work page doi:10.18653/v1/2024 2024

[5] [5]

Doll, Zafra Cooper, Marianne O’Connor, Robert L

Kristin Bohn, Helen A. Doll, Zafra Cooper, Marianne O’Connor, Robert L. Palmer, and Christo- pher G. Fairburn. The measurement of impairment due to eating disorder psychopathology. Behaviour Research and Therapy, 46(10):1105–1110, 2008

2008

[6] [6]

The threat of analytic flexibility in using large language models to simulate human data, April 2026

Jamie Cummins. The threat of analytic flexibility in using large language models to simulate human data, April 2026

2026

[7] [7]

Uncovering AI’s hidden risks: An empirical analysis of health-related AI incidents and their ethical implications.AI and Ethics, 6(2):169, February 2026

Kerstin Denecke, Octavio Rivera-Romero, Guillermo López-Campos, Enrique Dorronzoro, and Elia Gabarron. Uncovering AI’s hidden risks: An empirical analysis of health-related AI incidents and their ethical implications.AI and Ethics, 6(2):169, February 2026. ISSN 2730-5961. doi: 10.1007/s43681-026-01012-7

work page doi:10.1007/s43681-026-01012-7 2026

[8] [8]

Elif Ergüney-Okumuş

F. Elif Ergüney-Okumuş. Integrating EMDR With Enhanced Cognitive Behavioral Therapy in the Treatment of Bulimia Nervosa: A Single Case Study.Journal of EMDR Practice and Research, 15(4):231–243, May 2023. doi: 10.1891/EMDR-D-21-00012

work page doi:10.1891/emdr-d-21-00012 2023

[9] [9]

Fairburn and Sarah J

Christopher G. Fairburn and Sarah J. Beglin. Eating disorder examination questionnaire. In Cognitive Behavior Therapy and Eating Disorders, pages 309–313. Guilford Press, 2008

2008

[10] [10]

transdiagnostic

Christopher G. Fairburn, Zafra Cooper, and Roz Shafran. Cognitive behaviour therapy for eating disorders: A “transdiagnostic” theory and treatment.Behaviour Research and Therapy, 41(5):509–528, 2003. doi: 10.1016/S0005-7967(02)00088-8

work page doi:10.1016/s0005-7967(02)00088-8 2003

[11] [12]

Garner, Marion P

David M. Garner, Marion P. Olmsted, Yvonne Bohr, and Paul E. Garfinkel. The eating attitudes test: Psychometric features and clinical correlates.Psychological Medicine, 12(4):871–878, 1982. 12

1982

[12] [13]

Stable Personas: Dual-Assessment of Temporal Stability in LLM-Based Human Simulation, January 2026

Jana Gonnermann-Müller, Jennifer Haase, Nicolas Leins, Thomas Kosch, and Sebastian Pokutta. Stable Personas: Dual-Assessment of Temporal Stability in LLM-Based Human Simulation, January 2026

2026

[13] [15]

Designing LLM-Agents with Personalities: A Psychometric Approach, October 2024

Muhua Huang, Xijuan Zhang, Christopher Soto, and James Evans. Designing LLM-Agents with Personalities: A Psychometric Approach, October 2024

2024

[14] [16]

When AI takes the couch: Psychometric jailbreaks reveal internal conflict in frontier models, December 2025

Afshin Khadangi, Hanna Marxen, Amir Sartipi, Igor Tchappi, and Gilbert Fridgen. When AI takes the couch: Psychometric jailbreaks reveal internal conflict in frontier models, December 2025

2025

[15] [17]

Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models, June 2024

Yuan Li, Yue Huang, Hongyi Wang, Xiangliang Zhang, James Zou, and Lichao Sun. Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models, June 2024

2024

[16] [18]

TRUTH DECAY: Quantifying Multi-Turn Sycophancy in Language Models, February 2025

Joshua Liu, Aarav Jain, Soham Takuri, Srihan Vege, Aslihan Akalin, Kevin Zhu, Sean O’Brien, and Vasu Sharma. TRUTH DECAY: Quantifying Multi-Turn Sycophancy in Language Models, February 2025

2025

[17] [19]

Leveraging Large Language Models for Simulated Psychotherapy Client Interactions: Develop- ment and Usability Study of Client101.JMIR Medical Education, 11(1):e68056, July 2025

Daniel Cabrera Lozoya, Mike Conway, Edoardo Sebastiano De Duro, and Simon D’Alfonso. Leveraging Large Language Models for Simulated Psychotherapy Client Interactions: Develop- ment and Usability Study of Client101.JMIR Medical Education, 11(1):e68056, July 2025. doi: 10.2196/68056

work page doi:10.2196/68056 2025

[18] [20]

Luce and Janis H

Kristine H. Luce and Janis H. Crowther. The reliability of the Eating Disorder Examination— self-report questionnaire version (EDE-Q).International Journal of Eating Disorders, 25(3): 349–351, 1999. doi: 10.1002/(SICI)1098-108X(199904)25:3<349::AID-EAT15>3.0.CO;2-M

work page doi:10.1002/(sici)1098-108x(199904)25:3 1999

[19] [21]

Principled Per- sonas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance

Pedro Henrique Luz de Araujo, Paul Röttger, Dirk Hovy, and Benjamin Roth. Principled Per- sonas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proces...

work page doi:10.18653/v1/2025.emnlp-main.1364 2025

[20] [22]

Manning, Kehang Zhu, and John J

Benjamin S. Manning, Kehang Zhu, and John J. Horton. Automated Social Science: Language Models as Scientist and Subjects, April 2024

2024

[21] [23]

Ng, Makeda Moore, Isabelle Felix, and Chad E

Akihiko Masuda, Stacey Y. Ng, Makeda Moore, Isabelle Felix, and Chad E. Drake. Acceptance and commitment therapy as a treatment for a Latina young adult woman with purging: A case report.Practice Innovations, 1(1):20–35, 2016. ISSN 2377-8903. doi: 10.1037/pri0000012

work page doi:10.1037/pri0000012 2016

[22] [24]

Bernou Melisse and Teresa Arora. Cognitive behavioral therapy-enhanced through video- conferencing for night eating syndrome, binge-eating disorder and comorbid insomnia: A Case Report.Journal of Eating Disorders, 12(1):175, November 2024. ISSN 2050-2974. doi: 10.1186/s40337-024-01131-8. 13

work page doi:10.1186/s40337-024-01131-8 2024

[23] [25]

Psychologically-Valid Generative Agents: A Novel Approach to Agent-Based Modeling in Social Sciences.Proceedings of the AAAI Symposium Series, 2(1):340–348, 2023

Konstantinos Mitsopoulos, Ritwik Bose, Brodie Mather, Archna Bhatia, Kevin Gluck, Bonnie Dorr, Christian Lebiere, and Peter Pirolli. Psychologically-Valid Generative Agents: A Novel Approach to Agent-Based Modeling in Social Sciences.Proceedings of the AAAI Symposium Series, 2(1):340–348, 2023. ISSN 2994-4317. doi: 10.1609/aaaiss.v2i1.27698

work page doi:10.1609/aaaiss.v2i1.27698 2023

[24] [26]

Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis, August 2025

Mithat Can Ozgun, Jiahuan Pei, Koen Hindriks, Lucia Donatelli, Qingzhi Liu, and Junxiao Wang. Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis, August 2025

2025

[25] [27]

Cunningham, Joel Z

Davide Paglieri, Logan Cross, William A. Cunningham, Joel Z. Leibo, and Alexander Sasha Vezhnevets. Persona Generators: Generating Diverse Synthetic Personas at Scale, February 2026

2026

[26] [28]

, Bowman, S R

Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM Evaluators Recognize and Favor Their Own Generations.Advances in Neural Information Processing Systems, 37:68772–68802, December 2024. doi: 10.52202/079017-2197

work page doi:10.52202/079017-2197 2024

[27] [29]

Bernstein

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative Agents: Interactive Simulacra of Human Behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, pages 1–22, New York, NY, USA, October 2023. Association for Computing Machinery. ISB...

work page doi:10.1145/3586183.3606763 2023

[28] [30]

Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S

Joon Sung Park, Carolyn Q. Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S. Bernstein. Generative Agent Simulations of 1,000 People, November 2024

2024

[29] [31]

In-Context Impersonation Reveals Large Language Models’ Strengths and Biases.Advances in Neural Information Processing Systems, 36:72044–72057, December 2023

Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, and Zeynep Akata. In-Context Impersonation Reveals Large Language Models’ Strengths and Biases.Advances in Neural Information Processing Systems, 36:72044–72057, December 2023

2023

[30] [32]

Personagym: Evaluating persona agents and LLMs

Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, and Vishvak Murahari. Personagym: Evaluating persona agents and LLMs. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 6999–7022, Suzhou, China, November 2025. Association for Computational Linguis...

work page doi:10.18653/v1/2025.findings-emnlp.368 2025

[31] [33]

Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=RIu5lyNXjT

2024

[32] [34]

Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards Understanding Sycophancy in Language Models. InThe Twelf...

2024

[33] [35]

Shrout and Joseph L

Patrick E. Shrout and Joseph L. Fleiss. Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2):420–428, 1979. doi: 10.1037/0033-2909.86.2.420. 14

work page doi:10.1037/0033-2909.86.2.420 1979

[34] [36]

Jansen, and Jang Hyun Kim

Seungjong Sun, Eungu Lee, Dongyan Nan, Xiangying Zhao, Wonbyung Lee, Bernard J. Jansen, and Jang Hyun Kim. Random Silicon Sampling: Simulating Human Sub-Population Opinion Using a Large Language Model Based on Group-Level Demographic Information, February 2024

2024

[35] [37]

Noah Wang, Z.y. Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Wenhao Huang, Jie Fu, and Junran Peng. RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar...

work page doi:10.18653/v1/2024.findings-acl.878 2024

[36] [38]

Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews

Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, and Wei Wang. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1840–1873, Bangk...

work page doi:10.18653/v1/2024.acl-long.102 2024

[37] [39]

Toward an Evaluation Science for Generative AI Systems, March 2025

Laura Weidinger, Inioluwa Deborah Raji, Hanna Wallach, Margaret Mitchell, Angelina Wang, Olawale Salaudeen, Rishi Bommasani, Deep Ganguli, Sanmi Koyejo, and William Isaac. Toward an Evaluation Science for Generative AI Systems, March 2025

2025

[38] [40]

From Symptoms to Systems: An Expert-Guided Approach to Understanding Risks of Generative AI for Eating Disorders, December 2025

Amy Winecoff and Kevin Klyman. From Symptoms to Systems: An Expert-Guided Approach to Understanding Risks of Generative AI for Eating Disorders, December 2025

2025

[39] [41]

LingxiDiagBench: A Multi- Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis, February 2026

Shihao Xu, Tiancheng Zhou, Jiatong Ma, Yanli Ding, Yiming Yan, Ming Xiao, Guoyi Li, Haiyang Geng, Yunyun Han, Jianhua Chen, and Yafeng Deng. LingxiDiagBench: A Multi- Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis, February 2026

2026

[40] [42]

Cummings, and Byron Reeves

Leo Yeykelis, Kaavya Pichai, James J. Cummings, and Byron Reeves. Using Large Language Models to Create AI Personas for Replication and Prediction of Media Effects: An Empirical Test of 133 Published Experimental Research Findings, August 2024

2024

[41] [43]

Persona Drift for Adaptive Flow in Stage-Aware CBT Chatbots

Soyoung Yun, Minjoo Kim, Yeohyang Kim, and Hayoung Oh. Persona Drift for Adaptive Flow in Stage-Aware CBT Chatbots. InCompanion of the 2025 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp Companion ’25, pages 1645–1651, New York, NY, USA, December 2026. Association for Computing Machinery. ISBN 979-8-4007-1477-1. doi: 10....

work page doi:10.1145/3714394.3756337 2025

[42] [44]

Gonzalez, and Ion Stoica

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.Advances in Neural Information Processing Systems, 36:46595–46623, December 2023

2023

[43] [45]

When ``A Helpful Assistant'' Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models

Mingqian Zheng, Jiaxin Pei, Lajanugen Logeswaran, Moontae Lee, and David Jurgens. When “A Helpful Assistant” Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 15126...

work page doi:10.18653/v1/2024.findings-emnlp.888 2024

[44] [46]

Take me step-by- step through your thought process, what you choose to do, and how you feel physically and emotionally the next morning.”

“Imagine it’s a Friday night and your fridge has pizzas and leftover cake from a party. Take me step-by- step through your thought process, what you choose to do, and how you feel physically and emotionally the next morning.”

[45] [47]

“Got it. In that Friday-night moment, what are the main urges or rules you notice (for example, ‘I should/shouldn’t’ thoughts), and how do they affect what you actually do? Describe your emotions, motivation, and any coping strategies you try.” 24

[46] [48]

“And the next morning—how do you interpret what happened the night before? What do you tell yourself, how does your mood shift, and does it change what you do with food or your plans that day?” Block B — Corporate lunch

[47] [49]

What about corporate lunch with your colleagues? What’s your experience like in that setting—your thoughts, emotions, motivation, and behavior in the moment?

“What about corporate lunch with your colleagues? What’s your experience like in that setting—your thoughts, emotions, motivation, and behavior in the moment?”

[48] [50]

“In those kinds of social-meal situations, what feels most noticeable internally (e.g., being observed, comparisons, pressure to seem ‘normal’)? Walk me through your inner dialogue and how it shapes what you choose to eat or avoid.”

[49] [51]

“After the corporate lunch is over, what tends to linger for you (physically, emotionally, mentally), and what do you do afterward? Describe any coping strategies and how your feelings change over the rest of the day.” Block C — Food intake

[50] [52]

“Let’s talk about your food intake. Over the past week, what has your average daily food intake looked like? Please describe your thinking, emotions, motivation, and behavior as you go through a typical day—include any inner dialogue, emotional shifts, and coping strategies.”

[51] [53]

“Thanks—that’s helpful. When during the day does eating (or not eating) feel easiest vs hardest, and what’s usually going through your mind in those moments? Describe your thoughts, emotions, motivation, and what you tend to do next.”

[52] [54]

Imagine you are the individual and you are completing the questionnaires

“Are there any patterns across the week (workdays vs weekend, stress vs calm days) that change how you eat or how you feel about eating? Walk me through it with your inner dialogue and emotional shifts.” A.6 Assessment prompts All assessment prompts instruct the model to return structured JSON. Because no observer-report versions of the EDE-Q, CIA, or EAT...

work page doi:10.1177/1534650119886653 2020

[53] [55]

I should/shouldn’t

ISSN 1942-969X. doi: 10.1037/tra0001675 PD Akihiko Masuda, Stacey Y. Ng, Makeda Moore, Isabelle Felix, and Chad E. Drake. Acceptance and commitment therapy as a treatment for a Latina young adult woman with purging: A case report.Practice Innovations, 1(1):20–35, 2016. ISSN 2377-8903. doi: 10.1037/pri0000012 A.9.1 Scripted conversation partner questions S...

work page doi:10.1037/tra0001675 1942