pith. machine review for the scientific record. sign in

arxiv: 2605.13318 · v1 · submitted 2026-05-13 · 💻 cs.AI · cs.ET

Recognition: unknown

VERA-MH: Validation of Ethical and Responsible AI in Mental Health

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:11 UTC · model grok-4.3

classification 💻 cs.AI cs.ET
keywords AI safety evaluationmental health chatbotssuicidal ideationLLM-as-a-Judgeclinical rubricconversation simulationresponsible AIcrisis response testing
0
0 comments X

The pith

VERA-MH provides a three-step method to test whether chatbots respond safely to users expressing suicidal thoughts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VERA-MH as a clinically guided evaluation framework for assessing the safety of AI chatbots in mental health conversations. It centers on suicidal ideation risks by generating simulated dialogues with user personas that incorporate varied risk factors, demographics, and disclosure styles developed under clinical oversight. A separate model then judges each exchange using a step-by-step yes/no clinical rubric to flag specific failure modes consistently. Results from multiple conversations are aggregated into an overall safety rating for the chatbot under test. The authors demonstrate the approach on four major LLM providers to show how it can surface differences in crisis handling.

Core claim

VERA-MH evaluates chatbot safety through conversation simulation using clinically developed user personas, followed by LLM-as-a-Judge assessment with a flow-structured yes/no rubric that checks responses sequentially, and final aggregation of judgments into model ratings. This process focuses on detecting unsafe replies when users show signs of suicidal ideation and was applied to produce comparative results across leading providers.

What carries the argument

The clinically-developed yes/no rubric structured as a sequential flow in the LLM-as-a-Judge step, which enforces consistency and isolates specific safety failure modes during crisis simulations.

If this is right

  • Developers can run VERA-MH to locate and correct specific unsafe response patterns in mental health chatbots before deployment.
  • The framework supplies comparable safety scores across different models for the same set of crisis scenarios.
  • The aggregation step produces an overall rating that accounts for performance across diverse user personas and risk levels.
  • Later iterations of the method could add evaluation for other mental health risks such as self-harm or severe distress.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the rubric and personas hold up under broader testing, VERA-MH could serve as a reusable template for safety checks in other high-stakes AI applications like medical diagnosis support.
  • The simulation-plus-judge structure might reveal failure modes that simpler prompt-based tests miss, prompting more targeted fine-tuning of models.
  • Widespread adoption would create pressure for public benchmarks that track progress in reducing risky outputs over successive model releases.

Load-bearing premise

The personas created with clinical input and the LLM judge applying the yes/no rubric together accurately reflect real-world crisis interactions and human clinical decisions without introducing new biases or missing important failure modes.

What would settle it

A direct comparison of VERA-MH scores against independent ratings from human mental health clinicians on the identical set of simulated conversations; large systematic disagreements would show the framework does not match clinical judgment.

Figures

Figures reproduced from arXiv: 2605.13318 by Adam M. Chekroud, Emily Van Ark, Josh Gieringer, Kate H. Bentley, Luca Belli, Matt Hawrilenko, Millard Brown, Nilu Zhao, Pradip Thachile.

Figure 1
Figure 1. Figure 1: Results of the experiments. For each dimension, the Non Relevant column is computed as a [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Results of the experiments focused on Gemini models. [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Results of the experiments focused on GPT5.X family of models. [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Results of the experiments focused on Grok models. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Results of the experiments focused on Claude Opus models. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Results of the experiments focused on Claude Sonnet models. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of the conversational length of both user- and chatbot model. Users’ responses [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

Chatbot usage has increased, including in fields for which they were never developed for--notably mental health support. To that end, we introduce Validations of Ethical and Responsible AI in Mental Health (VERA-MH), a novel clinically-validated evaluation for safety of chatbots in the context of mental health support. The first iteration of VERA-MH focuses on Suicidal Ideation (SI) risks, by assessing how well chatbots can responds to users that might be in crisis. VERA-MH is comprised of three steps: conversation simulation, conversation judging and model rating. First, to simulate conversations with the chatbot under evaluation, another chatbot is tasked with role-playing users based on specific personas. Such user personas have been developed under clinical guidance, to make sure that, among others, multiple risk factors, demographic characteristics and disclosure factors were represented. In the judging step, a second support model is used as an LLM-as-a-Judge, together with a clinically-developed rubric. The rubric is structured as a flow, with a single Yes/No question asked each time, to improve answers' consistency and highlight models' failure modes. In the last stage, results of each conversation are aggregated to present the final evaluation of the chatbot. Together with the framework, we present the result of the evaluations for four leading LLM providers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VERA-MH, a framework for evaluating the safety of chatbots in mental health support with a focus on suicidal ideation (SI) risks. It consists of three steps: (1) conversation simulation using an LLM role-playing clinically-guided user personas that incorporate risk factors, demographics, and disclosure elements; (2) judgment of each conversation by a second LLM-as-a-Judge applying a clinically-developed yes/no rubric structured as a flow to improve consistency; and (3) aggregation of per-conversation results to produce an overall model rating. The paper reports evaluations of four leading LLM providers.

Significance. If the LLM-as-Judge outputs are shown to track human clinical judgments, VERA-MH would supply a reproducible, rubric-based benchmark for identifying safety failures in high-stakes conversational AI. The persona-driven simulation and sequential yes/no rubric design are constructive steps toward systematic failure-mode analysis in mental-health contexts.

major comments (2)
  1. [Abstract] Abstract and judging-step description: the claim that VERA-MH is 'clinically-validated' rests on personas and rubric having been 'developed under clinical guidance,' yet no inter-rater reliability study, correlation coefficient, or blind comparison between the LLM judge and licensed clinicians on the same transcripts is reported. This link is load-bearing for the central safety-assessment claim.
  2. [Results] Results section: the manuscript states that evaluations for four LLMs are presented, but supplies no quantitative metrics (accuracy, false-negative rate on SI risk, inter-judge agreement, or demographic-bias analysis), error bars, or comparison to human-clinician baselines, preventing verification that the framework supports the asserted safety conclusions.
minor comments (2)
  1. [Method] Provide the exact wording of the yes/no rubric questions and the aggregation formula used to compute the final model rating from per-conversation judgments.
  2. [Conversation simulation] Clarify the number of simulated conversations per model, the distribution of personas across risk levels, and any steps taken to mitigate prompt-injection or persona-consistency failures.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive review of our manuscript on VERA-MH. We have carefully considered the major comments and will revise the paper to improve accuracy and transparency. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract and judging-step description: the claim that VERA-MH is 'clinically-validated' rests on personas and rubric having been 'developed under clinical guidance,' yet no inter-rater reliability study, correlation coefficient, or blind comparison between the LLM judge and licensed clinicians on the same transcripts is reported. This link is load-bearing for the central safety-assessment claim.

    Authors: We agree that 'clinically-validated' overstates the current evidence. The personas and rubric were developed under clinical guidance, but no inter-rater reliability study, correlation analysis, or blind clinician comparison was performed. We will revise the abstract, methods, and related sections to replace 'clinically-validated' with 'clinically-guided' and add an explicit limitations statement noting the absence of these validation metrics while outlining plans for future clinician studies. revision: yes

  2. Referee: [Results] Results section: the manuscript states that evaluations for four LLMs are presented, but supplies no quantitative metrics (accuracy, false-negative rate on SI risk, inter-judge agreement, or demographic-bias analysis), error bars, or comparison to human-clinician baselines, preventing verification that the framework supports the asserted safety conclusions.

    Authors: We acknowledge that the results section currently emphasizes framework application and aggregated safety ratings without the requested quantitative details. The manuscript reports per-model outcomes from the rubric but lacks accuracy metrics, false-negative rates, inter-judge agreement, demographic-bias analysis, error bars, and human baselines. In revision we will expand the results with tables showing SI-risk detection percentages, persona-level breakdowns, and basic agreement statistics where available from our data. We will also add a limitations note on the lack of human-clinician baselines. revision: partial

standing simulated objections not resolved
  • No inter-rater reliability study or direct comparison between the LLM-as-Judge and licensed clinicians on the same transcripts was conducted, so correlation coefficients and related validation metrics cannot be reported without new data collection.

Circularity Check

0 steps flagged

VERA-MH defines a procedural framework with no self-referential reduction

full rationale

The paper introduces VERA-MH via three explicit procedural steps: (1) conversation simulation using personas developed under clinical guidance, (2) LLM-as-a-Judge evaluation with a flow-structured yes/no clinical rubric, and (3) aggregation of results. No equations, fitted parameters, or derivations are present that would make any output quantity equivalent to its inputs by construction. No self-citations are invoked as load-bearing premises, and the 'clinically-validated' descriptor refers to guidance in rubric and persona development rather than a closed loop where validation is defined by the framework's own outputs. The chain is self-contained as a descriptive methodology without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; full details on clinical validation process, exact rubric questions, and persona construction are unavailable, so the ledger is necessarily incomplete.

axioms (2)
  • domain assumption Clinical guidance produces personas that faithfully represent real risk factors, demographics, and disclosure patterns
    Stated in the abstract as the basis for user simulation.
  • domain assumption An LLM-as-a-Judge using a yes/no flow rubric produces consistent and clinically meaningful scores
    Central to the judging step described in the abstract.

pith-pipeline@v0.9.0 · 5572 in / 1432 out tokens · 34040 ms · 2026-05-14T19:11:43.610744+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

  1. [1]

    Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R

    William Agnew, A. Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R. McKee,The illusion of artificial inclusion, Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (New York, NY , USA), CHI ’24, Association for Computing Machinery, 2024

  2. [2]

    Ahmed Alaa, Thomas Hartvigsen, Niloufar Golchini, Shiladitya Dutta, Frances Dean, In- ioluwa Deborah Raji, and Travis Zack,Position: Medical large language model benchmarks should prioritize construct validity, Forty-second International Conference on Machine Learning Position Paper Track, 2025

  3. [3]

    Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal,Healthbench: Evaluating large language models towards improved human health, 2025. 9

  4. [4]

    3873–3896

    Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar, Sheri Grach, Lindsay Bertrand, Lames Danok, Prathiba Dhanesh, Jimmy Huang, Frank Rudzicz, and Elham Dolatabadi,When can we trust LLMs in mental health? large-scale benchmarks for reliable LLM evaluation, Pro- ceedings of the 19th Conference of the European Chapter of the Association for Computational Li...

  5. [5]

    Nadeem Badshah,Teenager died after asking chatgpt for ‘most successful’ way to take his life, inquest told, 2026

  6. [6]

    Jan Batzner, Leshem Choshen, Avijit Ghosh, Sree Harsha Nelaturu, Anastassia Kornilova, Damian Stachura, Yifan Mai, Asaf Yehudai, Anka Reuel, Irene Solaiman, and Stella Biderman, Every eval ever: Toward a common language for ai eval reporting, February 2026, Blog Post, EvalEval Coalition

  7. [7]

    Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cornelius Emde, Thomas Foster, Anna Gausen, María Grandury, Simeng Han, Valentin Hofmann, Lujain Ibrahim, Hazel Kim, Hannah Rose Kirk, Fangru Lin, Gabrielle Kaili-May...

  8. [8]

    Luca Belli, Kate Bentley, Will Alexander, Emily Ward, Matt Hawrilenko, Kelly Johnston, Mill Brown, and Adam Chekroud,Vera-mh concept paper, 2026

  9. [9]

    Bentley, Luca Belli, Adam M

    Kate H. Bentley, Luca Belli, Adam M. Chekroud, Emily J. Ward, Emily R. Dworkin, Emily Van Ark, Kelly M. Johnston, Will Alexander, Millard Brown, and Matt Hawrilenko,Vera-mh: Reliability and validity of an open-source ai safety evaluation in mental health, 2026

  10. [10]

    Torous,Chatgpt and mental healthcare: balancing benefits with risks of harms, BMJ Mental Health26(2023)

    Charlotte R Blease and John B. Torous,Chatgpt and mental healthcare: balancing benefits with risks of harms, BMJ Mental Health26(2023)

  11. [11]

    Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman,Nuanced metrics for measuring unintended bias with real data for text classification, Companion Proceed- ings of The 2019 World Wide Web Conference (New York, NY , USA), WWW ’19, Association for Computing Machinery, 2019, p. 491–500

  12. [12]

    Danah Boyd and Kate Crawford,Critical questions for big data, Information, Communication & Society15(2012), 662 – 679

  13. [13]

    now, they are sounding an alarm about ai chatbots, 2025

    Rhitu Chatterjee,Their teenage sons died by suicide. now, they are sounding an alarm about ai chatbots, 2025

  14. [14]

    Kimberlé Williams Crenshaw,Mapping the margins: intersectionality, identity politics, and violence against women of color, Stanford Law Review43(1991), 1241–1299

  15. [15]

    Meehl,Construct validity in psychological tests., Psychologi- cal bulletin52 4(1955), 281–302

    Lee Joseph Cronbach and Paul E. Meehl,Construct validity in psychological tests., Psychologi- cal bulletin52 4(1955), 281–302

  16. [16]

    Fernando Delgado, Stephen Yang, Michael Madaio, and Qian Yang,The participatory turn in ai design: Theoretical foundations and the current state of practice, Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (2023)

  17. [17]

    Gazi, Bryce Hill, Carla Gorban, Carolyn I

    Bridget Dwyer, Matthew Flathers, Akane Sano, Allison Dempsey, Andrea Cipriani, Asim H. Gazi, Bryce Hill, Carla Gorban, Carolyn I. Rodriguez, Charles Stromeyer, Darlene King, Eden Rozenblit, Gillian Strudwick, Jake Linardon, Jiaee Cheong, Joe Firth, Julian Herpertz, Julian Schwarz, Khai The Truong, Margaret Emerson, Martin P. Paulus, Michelle Patriquin, Yi...

  18. [18]

    1, 850–864

    Maria Eriksson, Erasmo Purificato, Arman Noroozian, João Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca,Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation, Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society8(2025), no. 1, 850–864

  19. [19]

    Center for AI Standards and Innovation/NIST,Practices for automated benchmark evaluations of language models, 2026

  20. [20]

    The European Center for Not-for Profit Law Stichting (ECNL) and SocietyInside,Framework for meaningful engagement 2.0, 2025

  21. [21]

    American Foundation for Suicide Prevention,Suicide statistics, 2024

  22. [22]

    Sebastian Gehrmann, Elizabeth Clark, and Thibault Sellam,Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text, J. Artif. Int. Res.77(2023)

  23. [23]

    Charles A. E. Goodhart,Problems of monetary management: The uk experience, 1984

  24. [24]

    1838–1849

    Gabriel Grill,Constructing capabilities: The politics of testing infrastructures for generative ai, Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAccT ’24, Association for Computing Machinery, 2024, p. 1838–1849

  25. [25]

    Amelia Hardy, Anka Reuel, Kiana Jafari Meimandi, Lisa Soder, Allie Griffith, Dylan M Asmar, Sanmi Koyejo, Michael S. Bernstein, and Mykel John Kochenderfer,More than marketing? on the information value of ai benchmarks for practitioners, Proceedings of the 30th International Conference on Intelligent User Interfaces (New York, NY , USA), IUI ’25, Associat...

  26. [26]

    Matthew Holmes, Thiago Lacerda, and Reva Schwartz,Making ai evaluation deployment relevant through context specification, 2026

  27. [27]

    Clifton, and John B

    Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David A. Clifton, and John B. Torous,A scoping review of large language models for generative tasks in mental health care, NPJ Digital Medicine8(2025)

  28. [28]

    Amnesty International,The social atrocity: Meta and the right to remedy for the rohingya, 2022

  29. [29]

    Abigail Z. Jacobs and Hanna Wallach,Measurement and fairness, Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAccT ’21, Association for Computing Machinery, 2021, p. 375–385

  30. [30]

    Andrea Kang, Jun Yu Chen, Zoe Lee-Youngzie, and Shuhao Fu,Synthetic data generation with llm for improved depression prediction, ArXivabs/2411.17672(2024)

  31. [31]

    Anjali Kantharuban, Jeremiah Milbauer, Emma Strubell, and Graham Neubig,Stereotype or personalization? user identity biases chatbot recommendations, ArXivabs/2410.05613(2024)

  32. [32]

    Torous, and Marlon Danilewitz,Use of large-language models for therapy: Promise and perils., Annals of internal medicine (2026)

    Robert A Kleinman, John B. Torous, and Marlon Danilewitz,Use of large-language models for therapy: Promise and perils., Annals of internal medicine (2026)

  33. [33]

    McBain, Robert Bozick, Melissa Diliberti, Li Ang Zhang, Fang Zhang, Alyssa Burnett, Aaron Kofner, Benjamin Rader, Joshua Breslau, Bradley D

    Ryan K. McBain, Robert Bozick, Melissa Diliberti, Li Ang Zhang, Fang Zhang, Alyssa Burnett, Aaron Kofner, Benjamin Rader, Joshua Breslau, Bradley D. Stein, Ateev Mehrotra, Lori Uscher Pines, Jonathan Cantor, and Hao Yu,Use of generative ai for mental health advice among us adolescents and young adults, JAMA Network Open8(2025), no. 11, e2542281–e2542281

  34. [34]

    Common Sense Media,Social ai companions, 2024

  35. [35]

    Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, and Nick Haber,Expressing stigma and inappropriate responses prevents llms from safely replacing mental health providers., Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAccT ’25, Association for Computing...

  36. [36]

    Adrian O’Dowd,Chatgpt: More than a million users show signs of mental health distress and mania each week, internal data suggest, BMJ391(2025)

  37. [37]

    Will Orr and Edward B. Kang,Ai as a sport: On the competitive epistemologies of benchmarking, Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAccT ’24, Association for Computing Machinery, 2024, p. 1875–1884

  38. [38]

    Ruby Ostrow and Adam Lopez,Llms reproduce stereotypes of sexual and gender minorities, 2025

  39. [39]

    Vedanta S P and Madhav Rao,Psychsynth: Advancing mental health ai through synthetic data generation and curriculum training, 2024 9th International Conference on Computer Science and Engineering (UBMK), 2024, pp. 1–6

  40. [40]

    Guerreiro, Pedro Henrique Martins, António Farinhas, and Ricardo Rei,Mindeval: Benchmarking language models on multi-turn mental health support, 2025

    José Pombal, Maya D’Eon, Nuno M. Guerreiro, Pedro Henrique Martins, António Farinhas, and Ricardo Rei,Mindeval: Benchmarking language models on multi-turn mental health support, 2025

  41. [41]

    Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, and Arvind Narayanan,Towards a science of ai agent reliability, 2026

  42. [42]

    Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, and Amandalynne Paullada,Ai and the everything in the whole wide world benchmark, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (J. Vanschoren and S. Yeung, eds.), vol. 1, 2021

  43. [43]

    Inioluwa Deborah Raji, Roxana Daneshjou, and Emily Alsentzer,It’s time to bench the medical exam benchmark, NEJM AI (2025)

  44. [44]

    1, 1200–1217

    Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Ramona Comanescu, Canfer Akbulut, Tom Stepleton, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, Iason Gabriel, Verena Rieser, William Isaac, and Laura Weidinger,Gaps in the safety evaluation of generative ai, Proceedings of the AAAI/ACM Conference on AI, Ethi...

  45. [45]

    you’re just ready:’ parents say chatgpt encouraged son to kill himself, 2025

    Ed Lavandera Rob Kuznia, Allison Gordon,‘you’re not rushing. you’re just ready:’ parents say chatgpt encouraged son to kill himself, 2025

  46. [46]

    Selbst, Danah Boyd, Sorelle A

    Andrew D. Selbst, Danah Boyd, Sorelle A. Friedler, Suresh Venkatasubramanian, and Janet Vertesi,Fairness and abstraction in sociotechnical systems, Proceedings of the Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAT* ’19, Association for Computing Machinery, 2019, p. 59–68

  47. [47]

    Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah A. Smith, Beyza Ermis, Marzieh Fadaee, and Sara Hooker,The leaderboard illusion, The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026

  48. [48]

    Hoyun Song, Migyeong Kang, Jisu Shin, Jihyun Kim, Chanbi Park, Hangyeol Yoo, Jihyun An, Alice Oh, Jinyoung Han, and KyungTae Lim,Mentalbench: A benchmark for evaluating psychiatric diagnostic capability of large language models, 2026

  49. [49]

    Thomas and David Uminsky,Reliance on metrics is a fundamental challenge for ai, Patterns3(2022), no

    Rachel L. Thomas and David Uminsky,Reliance on metrics is a fundamental challenge for ai, Patterns3(2022), no. 5, 100476

  50. [50]

    Pranav Narayanan Venkit, Jiayi Li, Yingfan Zhou, Sarah Michele Rajtmajer, and Shomir Wilson, A tale of two identities: An ethical audit of ai-crafted synthetic personas, AAAI Conference on Artificial Intelligence, 2026

  51. [51]

    Chiu, Jiayin Zhi, Shaun M

    Ruiyi Wang, Stephanie Milani, Jamie C. Chiu, Jiayin Zhi, Shaun M. Eack, Travis Labrum, Samuel M Murphy, Nev Jones, Kate V Hardy, Hong Shen, Fei Fang, and Zhiyu Chen,PATIENT- ψ: Using large language models to simulate patients for training mental health professionals, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (M...

  52. [52]

    Murphy,Synthetic patient and interview transcript creator: an essential tool for llms in mental health, Frontiers in Digital HealthV olume 7 - 2025(2025)

    Aleyna Warner, Jeffrey LeDue, Yutong Cao, Joseph Tham, and Timothy H. Murphy,Synthetic patient and interview transcript creator: an essential tool for llms in mental health, Frontiers in Digital HealthV olume 7 - 2025(2025)

  53. [53]

    Bertolacci, Emily Rosenblad, Sama Ghoba, Matthew Cun- ningham, Kevin Shunji Ikuta, Madeline E Moberg, Vincent Mougin, Chieh Han, Eve E

    Nicole Davis Weaver, Gregory J. Bertolacci, Emily Rosenblad, Sama Ghoba, Matthew Cun- ningham, Kevin Shunji Ikuta, Madeline E Moberg, Vincent Mougin, Chieh Han, Eve E. Wool, Yohannes Abate, Habeeb Omoponle Adewuyi, Qorinah Estiningtyas Sakilah Ad- nani, Leticia Akua Adzigbli, Aanuoluwapo Adeyimika Afolabi, Suneth Buddhika Agampodi, Bright Opoku Ahinkorah,...

  54. [54]

    Sociotechnical safety evaluation of generative ai systems,

    Laura Weidinger, Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, Iason Gabriel, Verena Rieser, and William S. Isaac,Sociotechnical safety evaluation of generative ai systems, ArXivabs/2310.11986(2023)

  55. [55]

    5367–5378

    Jia Xu, Tianyi Wei, Bojian Hou, Patryk Orzechowski, Shu Yang, Ruochen Jin, Rachael Paulbeck, Joost Wagenaar, George Demiris, and Li Shen,Mentalchat16k: A benchmark dataset for conversational mental health assistance, Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2 (New York, NY , USA), KDD ’25, Association for Com...

  56. [56]

    Nadine Yousif,Parents of teenager who took his own life sue openai, 2025

  57. [57]

    10, e2519941123

    Aliah Zewail, Alexandra Figueroa, Jesse Graham, and Mohammad Atari,Moral stereotyping in large language models, Proceedings of the National Academy of Sciences123(2026), no. 10, e2519941123

  58. [58]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica, Judging llm-as-a-judge with mt-bench and chatbot arena, Proceedings of the 37th International Conference on Neural Information Processing Systems (Red Hook, NY , USA), NIPS ’23,...