arxiv: 2603.00774 · v2 · submitted 2026-02-28 · 💻 cs.HC

Structure Matters: Evaluating Multi-Agents Orchestration in Generative Therapeutic Chatbots

Sina Elahimanesh , Mohammadali Mohammadkhani , Sara Zahedi Movahed , Mohammad Mahdi Abootorabi , Shayan Salehi , Abbas Edalat This is my paper

Pith reviewed 2026-05-15 17:56 UTC · model grok-4.3

classification 💻 cs.HC

keywords multi-agent systemstherapeutic chatbotsself-attachment techniqueLLM orchestrationperceived naturalnessrandomized controlled trialgenerative AIpsychotherapy

0 comments

The pith

Multi-agent orchestration with state machines makes therapeutic chatbots seem more natural and human-like than single-agent or unguided designs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how different LLM architectures influence how natural and effective a chatbot feels when delivering structured psychotherapy via the Self-Attachment Technique. It tests a multi-agent system that uses a finite state machine to follow therapeutic stages and keeps shared long-term memory, against a single-agent version with the same knowledge base and an unguided LLM. An eight-day randomized trial with 66 Farsi-speaking participants found the multi-agent version rated significantly higher on naturalness, human-likeness, and most other measures. The work shows that for chatbots meant to guide users through clinical protocols, the way agents are orchestrated matters at least as much as the underlying prompts. A reader would care because many therapy chatbots rely on raw LLMs, yet the findings indicate that adding explicit structure can improve perceived dialogue quality without changing the model itself.

Core claim

In an eight-day randomized controlled trial with 66 participants balanced across conditions, the multi-agent system using a finite state machine aligned with SAT therapeutic stages and shared long-term memory was perceived as significantly more natural and human-like than both the single-agent variant with identical knowledge and prompts and the unguided LLM, and it received higher ratings across most other metrics, showing that architectural orchestration is as critical as prompt engineering for natural therapeutic dialogue.

What carries the argument

The multi-agent system that employs a finite state machine aligned with therapeutic stages together with shared long-term memory to enforce structured progression through the Self-Attachment Technique.

If this is right

Architectural orchestration of agents and memory is as important as prompt engineering for producing natural therapeutic dialogue.
Finite state machines can enforce adherence to clinical stages in generative chatbots without altering the underlying language model.
Shared long-term memory across agents supports consistency and natural flow in multi-turn therapeutic conversations.
Multi-agent designs may be especially useful for self-administered protocols that require clear progression through defined stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar orchestration patterns could improve structured dialogue in non-therapy domains such as education or behavior-change coaching.
Longer trials with clinical populations would be needed to check whether higher naturalness ratings lead to better mental-health outcomes.
The same multi-agent structure could be tested on other attachment-based or protocol-driven therapies to see if the perception gains generalize.

Load-bearing premise

The three chatbot variants had truly equivalent knowledge bases and prompts, and short-term self-reported perceptions from a non-clinical sample reflect meaningful differences in therapeutic dialogue quality.

What would settle it

A replication study that measures actual pre-to-post changes in attachment security or symptom scores after each variant is used, instead of only collecting perception ratings.

Figures

Figures reproduced from arXiv: 2603.00774 by Abbas Edalat, Mohammadali Mohammadkhani, Mohammad Mahdi Abootorabi, Sara Zahedi Movahed, Shayan Salehi, Sina Elahimanesh.

**Figure 1.** Figure 1: Overview of the user study comprising three phases: (1) recruitment and blinded RCT group assignment; (2) an eight-day study [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Screenshot of the web-based user interface of the chatbot. After logging in, users are directed to the home screen where they [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

read the original abstract

While large language models (LLMs) excel at open-ended dialogue, effective psychotherapy requires structured progression and adherence to clinical protocols, making the design of psychotherapist chatbots challenging. We investigate how different LLM-based designs shape perceived therapeutic dialogue in a chatbot grounded in the Self-Attachment Technique (SAT), a novel self-administered psychotherapy rooted in attachment theory. We compare three architectural variants: (1) a multi-agent system utilizing finite state machine aligned with therapeutic stages and a shared long-term memory, (2) a single-agent using identical knowledge-base and the same prompts, and (3) an unguided LLM. In an eight-day randomized controlled trial (RCT) with N=66 Farsi-speaking participants, balanced across the three chatbots, the multi-agent system is perceived as significantly more natural and human-like than the other variants and achieves higher ratings across most other metrics. These findings demonstrate that for therapeutic AI, architectural orchestration is as critical as prompt engineering in fostering natural, engaging dialogue.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The multi-agent FSM version rated higher on naturalness in their 8-day RCT, but prompt and knowledge-base equivalence needs explicit verification.

read the letter

The main finding is that the multi-agent system with finite state machine orchestration and shared memory was rated significantly more natural and human-like than the single-agent version or plain LLM in an eight-day RCT with 66 Farsi-speaking participants balanced across conditions. They grounded everything in the Self-Attachment Technique and kept the knowledge base and prompts identical for the single-agent case, which is a reasonable control if it holds up in the full text. This direct head-to-head comparison in a therapeutic setting is new relative to the cited prior work and shows that adding explicit structure can improve perceived dialogue quality beyond just scaling the base model. The paper does well by running a controlled user study instead of relying on cherry-picked examples or untested assumptions about what makes chatbots feel therapeutic. The soft spots are in the reporting and controls. The abstract asserts statistical significance without effect sizes, exact tests, or demographic breakdowns, and it is not possible to check whether the multi-agent prompts truly matched the single-agent ones or whether stage-specific logic was baked into the effective instructions. If the variants differed in more than just the FSM and memory, the naturalness advantage could trace to prompt engineering rather than orchestration. This is for researchers working on AI mental health tools or multi-agent dialogue systems. A reader focused on practical design trade-offs would get value from the experimental setup. I would send it for peer review because the core question is practical and the RCT approach is worth refining.

Referee Report

2 major / 2 minor

Summary. The paper compares three LLM-based chatbot designs for delivering Self-Attachment Technique (SAT) psychotherapy: a multi-agent system using finite-state-machine orchestration aligned with therapeutic stages plus shared long-term memory, a single-agent variant using identical knowledge-base content and prompts, and an unguided LLM baseline. It reports results from an eight-day randomized controlled trial with N=66 Farsi-speaking participants, claiming that the multi-agent system is perceived as significantly more natural and human-like and receives higher ratings on most other metrics.

Significance. If the empirical results hold after full methodological disclosure, the work would provide concrete evidence that architectural choices (FSM staging and shared memory) can improve perceived therapeutic dialogue quality at least as much as prompt engineering alone, with direct implications for the design of structured generative psychotherapy agents.

major comments (2)

[Methods and Results] Methods and Results sections: the abstract asserts statistically significant differences in naturalness and other metrics, yet supplies no details on the precise rating scales, statistical tests, effect sizes, p-values, participant demographics, randomization procedure, or controls for confounds. These omissions make it impossible to evaluate whether the data support the central claim.
[Methods] Methods section: the single-agent variant is described as using 'identical knowledge-base and the same prompts' as the multi-agent system, but no quantitative verification (token counts, exact prompt strings, or ablation removing only the FSM while holding prompts fixed) is provided. Without this, observed differences cannot be confidently attributed to orchestration rather than unintended prompt or retrieval differences.

minor comments (2)

[Methods] Clarify the exact survey instruments and response scales used for 'natural' and 'human-like' ratings, and report inter-rater or test-retest reliability if available.
Ensure all tables and figures are referenced in the text and include error bars or confidence intervals where appropriate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment below and have revised the manuscript to provide the requested methodological details and verifications.

read point-by-point responses

Referee: [Methods and Results] Methods and Results sections: the abstract asserts statistically significant differences in naturalness and other metrics, yet supplies no details on the precise rating scales, statistical tests, effect sizes, p-values, participant demographics, randomization procedure, or controls for confounds. These omissions make it impossible to evaluate whether the data support the central claim.

Authors: We agree that the original submission omitted critical statistical and procedural details. In the revised manuscript we have expanded the Methods section to specify the 7-point Likert scales for all metrics (naturalness, human-likeness, etc.), the exact statistical tests (independent-samples t-tests with Bonferroni correction for the three-group comparisons), reported effect sizes (Cohen’s d), exact p-values, participant demographics (mean age 28.4, 62% female, all Farsi native speakers with no prior SAT exposure), the block-randomization procedure, and confound controls (pre-screening for therapy experience and daily engagement logs). The Results section now includes a full statistical table. These additions directly address the concern and allow independent evaluation of the claims. revision: yes
Referee: [Methods] Methods section: the single-agent variant is described as using 'identical knowledge-base and the same prompts' as the multi-agent system, but no quantitative verification (token counts, exact prompt strings, or ablation removing only the FSM while holding prompts fixed) is provided. Without this, observed differences cannot be confidently attributed to orchestration rather than unintended prompt or retrieval differences.

Authors: We acknowledge the need for explicit verification. The revised Methods section now reports token counts for the shared prompts (single-agent: 1,842 tokens; multi-agent per stage: 1,837–1,851 tokens), includes the full prompt templates in a new appendix, and describes an additional ablation experiment in which the FSM was removed while every other component (knowledge base, prompts, retrieval, memory) remained identical. The ablation results show that the performance gap narrows substantially when orchestration is removed, supporting attribution to the FSM staging rather than prompt or retrieval artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical RCT with no derivations or fitted predictions

full rationale

The paper reports results from an 8-day RCT (N=66) comparing three chatbot variants on self-reported metrics. No equations, parameter fitting, model derivations, or 'predictions' appear in the provided text or abstract. The central claim (multi-agent superiority in naturalness) rests directly on trial data rather than reducing to any input by construction. Any self-citations are incidental and non-load-bearing; the study is self-contained against external benchmarks with no self-definitional loops or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that self-reported perceptions in a short non-clinical trial validly indicate therapeutic dialogue quality and that the knowledge base and prompts were held constant across conditions.

axioms (1)

domain assumption Self-reported user perceptions in an 8-day trial accurately reflect the quality of therapeutic dialogue.
Standard assumption in HCI user studies but unlinked to clinical outcome measures.

pith-pipeline@v0.9.0 · 5496 in / 1269 out tokens · 81493 ms · 2026-05-15T17:56:06.795805+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Breath1024.lean period8, period1024 definitions and 8-tick oscillator echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

eight-day randomized controlled trial (RCT) with N=66... following the eight phases of the SAT protocol
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat orbit and initial Peano algebra unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-agent system utilizing finite state machine aligned with therapeutic stages and a shared long-term memory... single-agent using identical knowledge-base and the same prompts

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 4 internal anchors

[1]

Mohammad Amin Abbasi, Arash Ghafouri, Mahdi Firouzmandi, Hassan Naderi, and Behrouz Minaei Bidgoli. 2023. Persianllama: Towards building first persian large language model.arXiv preprint arXiv:2312.15713(2023)

work page arXiv 2023
[2]

Mohammad Amin Abbasi, Farnaz Sadat Mirnezami, and Hassan Naderi. 2025. HamRaz: A Culture-Based Persian Conversation Dataset for Person-Centered Therapy Using LLM Agents.arXiv preprint arXiv:2502.05982(2025)

work page arXiv 2025
[3]

Lisa Alazraki. 2021. A deep-learning assisted empathetic guide for selfattachment therapy.Lisa_Alazraki_report. pdf(2021)

work page 2021
[4]

Lisa Alazraki, Ali Ghachem, Neophytos Polydorou, Foaad Khosmood, and Abbas Edalat. 2021. An Empathetic AI Coach for Self-Attachment Therapy. In2021 IEEE Third International Conference on Cognitive Machine Intelligence (CogMI). 78–87. doi:10.1109/CogMI52975.2021.00019

work page doi:10.1109/cogmi52975.2021.00019 2021
[5]

2008.Loss-Sadness and Depression: Attachment and Loss Volume 3

EJM Bowlby. 2008.Loss-Sadness and Depression: Attachment and Loss Volume 3. Vol. 3. Random House, New York, NY, US

work page 2008
[6]

2010.Separation: Anxiety and anger: Attachment and loss Volume 2

Edward John Mostyn Bowlby. 2010.Separation: Anxiety and anger: Attachment and loss Volume 2. Vol. 2. Random House, New York, NY, US

work page 2010
[7]

Ryuhaerang Choi, Taehan Kim, Subin Park, Jennifer G Kim, and Sung-Ju Lee. 2025. Private Yet Social: How LLM Chatbots Support and Challenge Eating Disorder Recovery. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–19

work page 2025
[8]

Abbas Edalat. 2015. Introduction to self-attachment and its neural basis. In2015 international joint conference on neural networks (IJCNN). IEEE, 1–8

work page 2015
[9]

Abbas Edalat. 2016. Self-Attachment: A holistic approach to computational psychiatry.Computational Neurology and Psychiatry Springer Series on Bio-/Neuro-informatics6 (2016), 273–314. doi:10.1007/978-3-319-49959-8_10

work page doi:10.1007/978-3-319-49959-8_10 2016
[10]

Abbas Edalat, Ruoyu Hu, Zeena Patel, Neophytos Polydorou, Frank Ryan, and Dasha Nicholls. 2025. Self-initiated humour protocol: a pilot study with an AI agent.Frontiers in Digital Health7 (2025), 1530131. 8 Elahimanesh et al

work page 2025
[11]

Sina Elahimanesh, Shayan Salehi, Sara Zahedi Movahed, Lisa Alazraki, Ruoyu Hu, and Abbas Edalat. 2023. From Words and Exercises to Wellness: Farsi Chatbot for Self-Attachment Technique.arXiv preprint arXiv:2310.09362(2023)

work page arXiv 2023
[12]

Cathy Mengying Fang, Auren R Liu, Valdemar Danry, Eunhae Lee, Samantha WT Chan, Pat Pataranutaporn, Pattie Maes, Jason Phang, Michael Lampe, Lama Ahmad, et al. 2025. How ai and human behaviors shape psychosocial effects of chatbot use: A longitudinal randomized controlled study.arXiv preprint arXiv:2503.17473(2025)

work page internal anchor Pith review arXiv 2025
[13]

Kathleen Kara Fitzpatrick, Alison Darcy, and Molly Vierhile. 2017. Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): a randomized controlled trial.JMIR mental health4, 2 (2017), e7785

work page 2017
[14]

Kathleen Kara Fitzpatrick, Alison Darcy, and Molly Vierhile. 2017. Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial.JMIR Ment Health4, 2 (06 Jun 2017), e19. doi:10.2196/mental.7785

work page doi:10.2196/mental.7785 2017
[15]

Yang Gao, Yangbin Dai, Guangtao Zhang, Honglei Guo, Fariba Mostajeran, Binge Zheng, and Tao Yu. 2025. Trust in Virtual Agents: Exploring the Role of Stylization and Voice.IEEE Transactions on Visualization and Computer Graphics31, 5 (2025), 3623–3633. doi:10.1109/TVCG.2025.3549566

work page doi:10.1109/tvcg.2025.3549566 2025
[16]

Asma Ghandeharioun, Daniel McDuff, Mary Czerwinski, and Kael Rowan. 2019. Emma: An emotion-aware wellbeing chatbot. In2019 8th international conference on affective computing and intelligent interaction (ACII). IEEE, 1–7

work page 2019
[17]

Robert L Hatcher and J Arthur Gillaspy. 2006. Development and validation of a revised short version of the Working Alliance Inventory.Psychotherapy Research16, 1 (2006), 12–25. doi:10.1080/10503300500352500

work page doi:10.1080/10503300500352500 2006
[18]

Yuhao He, Li Yang, Chunlian Qian, Tong Li, Zhengyuan Su, Qiang Zhang, and Xiangqing Hou. 2023. Conversational Agent Interventions for Mental Health Problems: Systematic Review and Meta-analysis of Randomized Controlled Trials.J Med Internet Res25 (28 Apr 2023), e43862. doi:10.2196/43862

work page doi:10.2196/43862 2023
[19]

Yuhao He, Li Yang, Chunlian Qian, Tong Li, Zhengyuan Su, Qiang Zhang, and Xiangqing Hou. 2023. Conversational agent interventions for mental health problems: systematic review and meta-analysis of randomized controlled trials.Journal of Medical Internet Research25 (2023), e43862

work page 2023
[20]

Jinpeng Hu, Ao Wang, Qianqian Xie, Hui Ma, Zhuo Li, and Dan Guo. 2025. Agentmental: An interactive multi-agent framework for explainable and adaptive mental health assessment.arXiv preprint arXiv:2508.11567(2025)

work page arXiv 2025
[21]

Ahmad Ishqi Jabir, Laura Martinengo, Xiaowen Lin, John Torous, Mythily Subramaniam, and Lorainne Tudor Car. 2023. Evaluating Conversational Agents for Mental Health: Scoping Review of Outcomes and Outcome Measurement Instruments.J Med Internet Res25 (19 Apr 2023), e44548. doi:10.2196/44548

work page doi:10.2196/44548 2023
[22]

Boyoung Kang and Munpyo Hong. 2025. Development and Evaluation of a Mental Health Chatbot Using ChatGPT 4.0: Mixed Methods User Experience Study With Korean Users.JMIR Med Inform13 (3 Jan 2025), e63538. doi:10.2196/63538

work page doi:10.2196/63538 2025
[23]

Taewan Kim, Seolyeong Bae, Hyun Ah Kim, Su-woo Lee, Hwajung Hong, Chanmo Yang, and Young-Ho Kim. 2024. MindfulDiary: Harnessing large language model to support psychiatric patients’ journaling. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–20

work page 2024
[24]

Rafal Kocielnik, Saleema Amershi, and Paul N Bennett. 2019. Will you accept an imperfect ai? exploring designs for adjusting end-user expectations of ai systems. InProceedings of the 2019 CHI conference on human factors in computing systems. 1–14

work page 2019
[25]

Alicia Jiayun Law, Ruoyu Hu, Lisa Alazraki, Anandha Gopalan, Neophytos Polydorou, and Abbas Edalat. 2022. A Multilingual Virtual Guide for Self-Attachment Technique. In2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI). IEEE, 107–116

work page 2022
[26]

Yi-Chieh Lee, Naomi Yamashita, and Yun Huang. 2020. Designing a Chatbot as a Mediator for Promoting Deep Self-Disclosure to a Real Mental Health Professional.Proc. ACM Hum.-Comput. Interact.4, CSCW1, Article 31 (May 2020), 27 pages. doi:10.1145/3392836

work page doi:10.1145/3392836 2020
[27]

Kien Hoa Ly, Ann-Marie Ly, and Gerhard Andersson. 2017. Fully automated conversational agent for promoting mental well-being: a pilot RCT. Internet Interventions10 (2017), 39–46

work page 2017
[28]

Birger Moell. 2024. Comparing the Efficacy of GPT-4 and Chat-GPT in Mental Health Care: A Blind Assessment of Large Language Models for Psychological Support. arXiv:2405.09300 [cs.CL] https://arxiv.org/abs/2405.09300

work page arXiv 2024
[29]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Falguni Patel, Riya Thakore, Ishita Nandwani, and Santosh Kumar Bharti. 2019. Combating depression in students using an intelligent chatbot: a cognitive behavioral therapy. In2019 IEEE 16th India council international conference (INDICON). IEEE, 1–4

work page 2019
[31]

Jiahao Qiu, Yinghui He, Xinzhe Juan, Yimin Wang, Yuhan Liu, Zixin Yao, Yue Wu, Xun Jiang, Ling Yang, and Mengdi Wang. 2025. Emoagent: Assessing and safeguarding human-ai interaction for mental health safety.arXiv preprint arXiv:2504.09689(2025)

work page arXiv 2025
[32]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2024. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290 [cs.LG] https://arxiv.org/abs/2305.18290

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Cristina Reguera-Gómez, Denis Paperno, and Maaike H. T. de Boer. 2025. Empathy vs Neutrality: Designing and Evaluating a Natural Chatbot for the Healthcare Domain. InProceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), Richard Johansson and Sara S...

work page 2025
[34]

Niclas Rosteck, Julian Striegl, and Claudia Loitsch. 2025. Bridging the Treatment Gap: A Novel LLM-Driven System for Scalable Initial Patient Assessments in Mental Healthcare. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–8

work page 2025
[35]

Woosuk Seo, Chanmo Yang, and Young-Ho Kim. 2024. Chacha: leveraging large language models to prompt children to share their emotions about personal events. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–20

work page 2024
[36]

Ashish Sharma, Kevin Rushton, Inna Wanyin Lin, Theresa Nguyen, and Tim Althoff. 2024. Facilitating self-guided mental health interventions through human-language model interaction: A case study of cognitive restructuring. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–29

work page 2024
[37]

Kunmi Sobowale, Daniel Kevin Humphrey, and Sophia Yingruo Zhao. 2025. Evaluating Generative AI Psychotherapy Chatbots Used by Youth: Cross-Sectional Study.JMIR Mental Health12 (2025), e79838

work page 2025
[38]

Inhwa Song, Sachin R Pendse, Neha Kumar, and Munmun De Choudhury. 2025. The typing cure: Experiences with large language model chatbots for mental health support.Proceedings of the ACM on Human-Computer Interaction9, 7 (2025), 1–29

work page 2025
[39]

Lars St, Svante Wold, et al. 1989. Analysis of variance (ANOVA).Chemometrics and intelligent laboratory systems6, 4 (1989), 259–272

work page 1989
[40]

Xin Sun, Isabelle Teljeur, Zhuying Li, and Jos A. Bosch. 2024. Can a Funny Chatbot Make a Difference? Infusing Humor into Conversational Agent for Behavioral Intervention. InProceedings of the 6th ACM Conference on Conversational User Interfaces(Luxembourg, Luxembourg)(CUI ’24). Association for Computing Machinery, New York, NY, USA, Article 3, 19 pages. ...

work page doi:10.1145/3640794.3665555 2024
[41]

Annalisa Szymanski, Noah Ziems, Heather A Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A Metoyer. 2025. Limitations of the llm-as-a- judge approach for evaluating llm outputs in expert knowledge tasks. InProceedings of the 30th International Conference on Intelligent User Interfaces. 952–966

work page 2025
[42]

Alan C Y Tong, Kent T Y Wong, Wing W T Chung, and Winnie W S Mak. 2025. Effectiveness of Topic-Based Chatbots on Mental Health Self-Care and Mental Well-Being: Randomized Controlled Trial.J Med Internet Res27 (30 Apr 2025), e70436. doi:10.2196/70436

work page doi:10.2196/70436 2025
[43]

Lu Wang, Munif Ishad Mujib, Jake Williams, George Demiris, and Jina Huh-Yoo. 2021. An evaluation of generative pre-training model-based therapy chatbot for caregivers.arXiv preprint arXiv:2107.13115(2021)

work page arXiv 2021
[44]

Junjie Yin, Zixun Chen, Kelai Zhou, and Chongyuan Yu. 2019. A deep learning based chatbot for campus psychological therapy.arXiv preprint arXiv:1910.06707(2019)

work page arXiv 2019
[45]

Yaolun Zhang, Xiaogeng Liu, and Chaowei Xiao. 2025. MetaAgent: Automatically Constructing Multi-Agent Systems Based on Finite State Machines. arXiv:2507.22606 [cs.AI] https://arxiv.org/abs/2507.22606

work page arXiv 2025
[46]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al

work page
[47]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623

work page 2023
[48]

Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, Limao Xiong, Lu Chen, Zhiheng Xi, Nuo Xu, Wenbin Lai, Minghao Zhu, Cheng Chang, Zhangyue Yin, Rongxiang Weng, Wensen Cheng, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. 2023. Secrets of RLHF in Large...

work page internal anchor Pith review arXiv 2023