Structure Matters: Evaluating Multi-Agents Orchestration in Generative Therapeutic Chatbots
Pith reviewed 2026-05-15 17:56 UTC · model grok-4.3
The pith
Multi-agent orchestration with state machines makes therapeutic chatbots seem more natural and human-like than single-agent or unguided designs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In an eight-day randomized controlled trial with 66 participants balanced across conditions, the multi-agent system using a finite state machine aligned with SAT therapeutic stages and shared long-term memory was perceived as significantly more natural and human-like than both the single-agent variant with identical knowledge and prompts and the unguided LLM, and it received higher ratings across most other metrics, showing that architectural orchestration is as critical as prompt engineering for natural therapeutic dialogue.
What carries the argument
The multi-agent system that employs a finite state machine aligned with therapeutic stages together with shared long-term memory to enforce structured progression through the Self-Attachment Technique.
If this is right
- Architectural orchestration of agents and memory is as important as prompt engineering for producing natural therapeutic dialogue.
- Finite state machines can enforce adherence to clinical stages in generative chatbots without altering the underlying language model.
- Shared long-term memory across agents supports consistency and natural flow in multi-turn therapeutic conversations.
- Multi-agent designs may be especially useful for self-administered protocols that require clear progression through defined stages.
Where Pith is reading between the lines
- Similar orchestration patterns could improve structured dialogue in non-therapy domains such as education or behavior-change coaching.
- Longer trials with clinical populations would be needed to check whether higher naturalness ratings lead to better mental-health outcomes.
- The same multi-agent structure could be tested on other attachment-based or protocol-driven therapies to see if the perception gains generalize.
Load-bearing premise
The three chatbot variants had truly equivalent knowledge bases and prompts, and short-term self-reported perceptions from a non-clinical sample reflect meaningful differences in therapeutic dialogue quality.
What would settle it
A replication study that measures actual pre-to-post changes in attachment security or symptom scores after each variant is used, instead of only collecting perception ratings.
Figures
read the original abstract
While large language models (LLMs) excel at open-ended dialogue, effective psychotherapy requires structured progression and adherence to clinical protocols, making the design of psychotherapist chatbots challenging. We investigate how different LLM-based designs shape perceived therapeutic dialogue in a chatbot grounded in the Self-Attachment Technique (SAT), a novel self-administered psychotherapy rooted in attachment theory. We compare three architectural variants: (1) a multi-agent system utilizing finite state machine aligned with therapeutic stages and a shared long-term memory, (2) a single-agent using identical knowledge-base and the same prompts, and (3) an unguided LLM. In an eight-day randomized controlled trial (RCT) with N=66 Farsi-speaking participants, balanced across the three chatbots, the multi-agent system is perceived as significantly more natural and human-like than the other variants and achieves higher ratings across most other metrics. These findings demonstrate that for therapeutic AI, architectural orchestration is as critical as prompt engineering in fostering natural, engaging dialogue.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper compares three LLM-based chatbot designs for delivering Self-Attachment Technique (SAT) psychotherapy: a multi-agent system using finite-state-machine orchestration aligned with therapeutic stages plus shared long-term memory, a single-agent variant using identical knowledge-base content and prompts, and an unguided LLM baseline. It reports results from an eight-day randomized controlled trial with N=66 Farsi-speaking participants, claiming that the multi-agent system is perceived as significantly more natural and human-like and receives higher ratings on most other metrics.
Significance. If the empirical results hold after full methodological disclosure, the work would provide concrete evidence that architectural choices (FSM staging and shared memory) can improve perceived therapeutic dialogue quality at least as much as prompt engineering alone, with direct implications for the design of structured generative psychotherapy agents.
major comments (2)
- [Methods and Results] Methods and Results sections: the abstract asserts statistically significant differences in naturalness and other metrics, yet supplies no details on the precise rating scales, statistical tests, effect sizes, p-values, participant demographics, randomization procedure, or controls for confounds. These omissions make it impossible to evaluate whether the data support the central claim.
- [Methods] Methods section: the single-agent variant is described as using 'identical knowledge-base and the same prompts' as the multi-agent system, but no quantitative verification (token counts, exact prompt strings, or ablation removing only the FSM while holding prompts fixed) is provided. Without this, observed differences cannot be confidently attributed to orchestration rather than unintended prompt or retrieval differences.
minor comments (2)
- [Methods] Clarify the exact survey instruments and response scales used for 'natural' and 'human-like' ratings, and report inter-rater or test-retest reliability if available.
- Ensure all tables and figures are referenced in the text and include error bars or confidence intervals where appropriate.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. We address each major comment below and have revised the manuscript to provide the requested methodological details and verifications.
read point-by-point responses
-
Referee: [Methods and Results] Methods and Results sections: the abstract asserts statistically significant differences in naturalness and other metrics, yet supplies no details on the precise rating scales, statistical tests, effect sizes, p-values, participant demographics, randomization procedure, or controls for confounds. These omissions make it impossible to evaluate whether the data support the central claim.
Authors: We agree that the original submission omitted critical statistical and procedural details. In the revised manuscript we have expanded the Methods section to specify the 7-point Likert scales for all metrics (naturalness, human-likeness, etc.), the exact statistical tests (independent-samples t-tests with Bonferroni correction for the three-group comparisons), reported effect sizes (Cohen’s d), exact p-values, participant demographics (mean age 28.4, 62% female, all Farsi native speakers with no prior SAT exposure), the block-randomization procedure, and confound controls (pre-screening for therapy experience and daily engagement logs). The Results section now includes a full statistical table. These additions directly address the concern and allow independent evaluation of the claims. revision: yes
-
Referee: [Methods] Methods section: the single-agent variant is described as using 'identical knowledge-base and the same prompts' as the multi-agent system, but no quantitative verification (token counts, exact prompt strings, or ablation removing only the FSM while holding prompts fixed) is provided. Without this, observed differences cannot be confidently attributed to orchestration rather than unintended prompt or retrieval differences.
Authors: We acknowledge the need for explicit verification. The revised Methods section now reports token counts for the shared prompts (single-agent: 1,842 tokens; multi-agent per stage: 1,837–1,851 tokens), includes the full prompt templates in a new appendix, and describes an additional ablation experiment in which the FSM was removed while every other component (knowledge base, prompts, retrieval, memory) remained identical. The ablation results show that the performance gap narrows substantially when orchestration is removed, supporting attribution to the FSM staging rather than prompt or retrieval artifacts. revision: yes
Circularity Check
No circularity: purely empirical RCT with no derivations or fitted predictions
full rationale
The paper reports results from an 8-day RCT (N=66) comparing three chatbot variants on self-reported metrics. No equations, parameter fitting, model derivations, or 'predictions' appear in the provided text or abstract. The central claim (multi-agent superiority in naturalness) rests directly on trial data rather than reducing to any input by construction. Any self-citations are incidental and non-load-bearing; the study is self-contained against external benchmarks with no self-definitional loops or ansatz smuggling.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-reported user perceptions in an 8-day trial accurately reflect the quality of therapeutic dialogue.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Breath1024.leanperiod8, period1024 definitions and 8-tick oscillator echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
eight-day randomized controlled trial (RCT) with N=66... following the eight phases of the SAT protocol
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat orbit and initial Peano algebra unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
multi-agent system utilizing finite state machine aligned with therapeutic stages and a shared long-term memory... single-agent using identical knowledge-base and the same prompts
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Lisa Alazraki. 2021. A deep-learning assisted empathetic guide for selfattachment therapy.Lisa_Alazraki_report. pdf(2021)
work page 2021
-
[4]
Lisa Alazraki, Ali Ghachem, Neophytos Polydorou, Foaad Khosmood, and Abbas Edalat. 2021. An Empathetic AI Coach for Self-Attachment Therapy. In2021 IEEE Third International Conference on Cognitive Machine Intelligence (CogMI). 78–87. doi:10.1109/CogMI52975.2021.00019
-
[5]
2008.Loss-Sadness and Depression: Attachment and Loss Volume 3
EJM Bowlby. 2008.Loss-Sadness and Depression: Attachment and Loss Volume 3. Vol. 3. Random House, New York, NY, US
work page 2008
-
[6]
2010.Separation: Anxiety and anger: Attachment and loss Volume 2
Edward John Mostyn Bowlby. 2010.Separation: Anxiety and anger: Attachment and loss Volume 2. Vol. 2. Random House, New York, NY, US
work page 2010
-
[7]
Ryuhaerang Choi, Taehan Kim, Subin Park, Jennifer G Kim, and Sung-Ju Lee. 2025. Private Yet Social: How LLM Chatbots Support and Challenge Eating Disorder Recovery. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–19
work page 2025
-
[8]
Abbas Edalat. 2015. Introduction to self-attachment and its neural basis. In2015 international joint conference on neural networks (IJCNN). IEEE, 1–8
work page 2015
-
[9]
Abbas Edalat. 2016. Self-Attachment: A holistic approach to computational psychiatry.Computational Neurology and Psychiatry Springer Series on Bio-/Neuro-informatics6 (2016), 273–314. doi:10.1007/978-3-319-49959-8_10
-
[10]
Abbas Edalat, Ruoyu Hu, Zeena Patel, Neophytos Polydorou, Frank Ryan, and Dasha Nicholls. 2025. Self-initiated humour protocol: a pilot study with an AI agent.Frontiers in Digital Health7 (2025), 1530131. 8 Elahimanesh et al
work page 2025
- [11]
-
[12]
Cathy Mengying Fang, Auren R Liu, Valdemar Danry, Eunhae Lee, Samantha WT Chan, Pat Pataranutaporn, Pattie Maes, Jason Phang, Michael Lampe, Lama Ahmad, et al. 2025. How ai and human behaviors shape psychosocial effects of chatbot use: A longitudinal randomized controlled study.arXiv preprint arXiv:2503.17473(2025)
work page internal anchor Pith review arXiv 2025
-
[13]
Kathleen Kara Fitzpatrick, Alison Darcy, and Molly Vierhile. 2017. Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): a randomized controlled trial.JMIR mental health4, 2 (2017), e7785
work page 2017
-
[14]
Kathleen Kara Fitzpatrick, Alison Darcy, and Molly Vierhile. 2017. Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial.JMIR Ment Health4, 2 (06 Jun 2017), e19. doi:10.2196/mental.7785
-
[15]
Yang Gao, Yangbin Dai, Guangtao Zhang, Honglei Guo, Fariba Mostajeran, Binge Zheng, and Tao Yu. 2025. Trust in Virtual Agents: Exploring the Role of Stylization and Voice.IEEE Transactions on Visualization and Computer Graphics31, 5 (2025), 3623–3633. doi:10.1109/TVCG.2025.3549566
-
[16]
Asma Ghandeharioun, Daniel McDuff, Mary Czerwinski, and Kael Rowan. 2019. Emma: An emotion-aware wellbeing chatbot. In2019 8th international conference on affective computing and intelligent interaction (ACII). IEEE, 1–7
work page 2019
-
[17]
Robert L Hatcher and J Arthur Gillaspy. 2006. Development and validation of a revised short version of the Working Alliance Inventory.Psychotherapy Research16, 1 (2006), 12–25. doi:10.1080/10503300500352500
-
[18]
Yuhao He, Li Yang, Chunlian Qian, Tong Li, Zhengyuan Su, Qiang Zhang, and Xiangqing Hou. 2023. Conversational Agent Interventions for Mental Health Problems: Systematic Review and Meta-analysis of Randomized Controlled Trials.J Med Internet Res25 (28 Apr 2023), e43862. doi:10.2196/43862
-
[19]
Yuhao He, Li Yang, Chunlian Qian, Tong Li, Zhengyuan Su, Qiang Zhang, and Xiangqing Hou. 2023. Conversational agent interventions for mental health problems: systematic review and meta-analysis of randomized controlled trials.Journal of Medical Internet Research25 (2023), e43862
work page 2023
- [20]
-
[21]
Ahmad Ishqi Jabir, Laura Martinengo, Xiaowen Lin, John Torous, Mythily Subramaniam, and Lorainne Tudor Car. 2023. Evaluating Conversational Agents for Mental Health: Scoping Review of Outcomes and Outcome Measurement Instruments.J Med Internet Res25 (19 Apr 2023), e44548. doi:10.2196/44548
-
[22]
Boyoung Kang and Munpyo Hong. 2025. Development and Evaluation of a Mental Health Chatbot Using ChatGPT 4.0: Mixed Methods User Experience Study With Korean Users.JMIR Med Inform13 (3 Jan 2025), e63538. doi:10.2196/63538
-
[23]
Taewan Kim, Seolyeong Bae, Hyun Ah Kim, Su-woo Lee, Hwajung Hong, Chanmo Yang, and Young-Ho Kim. 2024. MindfulDiary: Harnessing large language model to support psychiatric patients’ journaling. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–20
work page 2024
-
[24]
Rafal Kocielnik, Saleema Amershi, and Paul N Bennett. 2019. Will you accept an imperfect ai? exploring designs for adjusting end-user expectations of ai systems. InProceedings of the 2019 CHI conference on human factors in computing systems. 1–14
work page 2019
-
[25]
Alicia Jiayun Law, Ruoyu Hu, Lisa Alazraki, Anandha Gopalan, Neophytos Polydorou, and Abbas Edalat. 2022. A Multilingual Virtual Guide for Self-Attachment Technique. In2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI). IEEE, 107–116
work page 2022
-
[26]
Yi-Chieh Lee, Naomi Yamashita, and Yun Huang. 2020. Designing a Chatbot as a Mediator for Promoting Deep Self-Disclosure to a Real Mental Health Professional.Proc. ACM Hum.-Comput. Interact.4, CSCW1, Article 31 (May 2020), 27 pages. doi:10.1145/3392836
-
[27]
Kien Hoa Ly, Ann-Marie Ly, and Gerhard Andersson. 2017. Fully automated conversational agent for promoting mental well-being: a pilot RCT. Internet Interventions10 (2017), 39–46
work page 2017
- [28]
-
[29]
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Falguni Patel, Riya Thakore, Ishita Nandwani, and Santosh Kumar Bharti. 2019. Combating depression in students using an intelligent chatbot: a cognitive behavioral therapy. In2019 IEEE 16th India council international conference (INDICON). IEEE, 1–4
work page 2019
- [31]
-
[32]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2024. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290 [cs.LG] https://arxiv.org/abs/2305.18290
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Cristina Reguera-Gómez, Denis Paperno, and Maaike H. T. de Boer. 2025. Empathy vs Neutrality: Designing and Evaluating a Natural Chatbot for the Healthcare Domain. InProceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), Richard Johansson and Sara S...
work page 2025
-
[34]
Niclas Rosteck, Julian Striegl, and Claudia Loitsch. 2025. Bridging the Treatment Gap: A Novel LLM-Driven System for Scalable Initial Patient Assessments in Mental Healthcare. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–8
work page 2025
-
[35]
Woosuk Seo, Chanmo Yang, and Young-Ho Kim. 2024. Chacha: leveraging large language models to prompt children to share their emotions about personal events. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–20
work page 2024
-
[36]
Ashish Sharma, Kevin Rushton, Inna Wanyin Lin, Theresa Nguyen, and Tim Althoff. 2024. Facilitating self-guided mental health interventions through human-language model interaction: A case study of cognitive restructuring. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–29
work page 2024
-
[37]
Kunmi Sobowale, Daniel Kevin Humphrey, and Sophia Yingruo Zhao. 2025. Evaluating Generative AI Psychotherapy Chatbots Used by Youth: Cross-Sectional Study.JMIR Mental Health12 (2025), e79838
work page 2025
-
[38]
Inhwa Song, Sachin R Pendse, Neha Kumar, and Munmun De Choudhury. 2025. The typing cure: Experiences with large language model chatbots for mental health support.Proceedings of the ACM on Human-Computer Interaction9, 7 (2025), 1–29
work page 2025
-
[39]
Lars St, Svante Wold, et al. 1989. Analysis of variance (ANOVA).Chemometrics and intelligent laboratory systems6, 4 (1989), 259–272
work page 1989
-
[40]
Xin Sun, Isabelle Teljeur, Zhuying Li, and Jos A. Bosch. 2024. Can a Funny Chatbot Make a Difference? Infusing Humor into Conversational Agent for Behavioral Intervention. InProceedings of the 6th ACM Conference on Conversational User Interfaces(Luxembourg, Luxembourg)(CUI ’24). Association for Computing Machinery, New York, NY, USA, Article 3, 19 pages. ...
-
[41]
Annalisa Szymanski, Noah Ziems, Heather A Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A Metoyer. 2025. Limitations of the llm-as-a- judge approach for evaluating llm outputs in expert knowledge tasks. InProceedings of the 30th International Conference on Intelligent User Interfaces. 952–966
work page 2025
-
[42]
Alan C Y Tong, Kent T Y Wong, Wing W T Chung, and Winnie W S Mak. 2025. Effectiveness of Topic-Based Chatbots on Mental Health Self-Care and Mental Well-Being: Randomized Controlled Trial.J Med Internet Res27 (30 Apr 2025), e70436. doi:10.2196/70436
- [43]
- [44]
- [45]
-
[46]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al
-
[47]
Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623
work page 2023
-
[48]
Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, Limao Xiong, Lu Chen, Zhiheng Xi, Nuo Xu, Wenbin Lai, Minghao Zhu, Cheng Chang, Zhangyue Yin, Rongxiang Weng, Wensen Cheng, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. 2023. Secrets of RLHF in Large...
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.