Recognition: unknown
InsightFlow: LLM-Driven Synthesis of Patient Narratives for Mental Health into Causal Models
Pith reviewed 2026-05-10 15:23 UTC · model grok-4.3
The pith
Large language models can generate causal graphs from therapy dialogues that match expert variability in structure and clinical usefulness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
InsightFlow uses large language models to automatically build 5P-aligned causal graphs from psychotherapy intake transcripts, linking symptoms and psychosocial factors into structured models. Evaluated against expert human formulations on 46 transcripts, the outputs achieve structural similarity comparable to inter-annotator agreement through NetSimile, high semantic similarity via embeddings, and moderate expert ratings on clinical criteria. LLM graphs tend to show more interconnections than the linear patterns in human graphs, while maintaining comparable overall complexity and content coverage. The results indicate that the generated graphs fall within the natural variability of expert 5P
What carries the argument
InsightFlow, an LLM pipeline that extracts and connects causal elements from dialogues into 5P-structured graphs, where 5P denotes the clinical categories of presenting problems, predisposing factors, precipitating factors, perpetuating factors, and protective factors.
If this is right
- Clinicians could begin case formulation from an LLM draft rather than a blank page, reducing initial organization time.
- Automated graphs might serve as a consistent base that different therapists can refine, narrowing some sources of inter-clinician variation.
- Clinical software could integrate similar synthesis steps to produce causal overviews during or after intake sessions.
- Further gains would require better handling of temporal order and reduction of redundant links in the generated graphs.
- The same dialogue-to-graph method could be tested on other narrative-heavy domains that need causal organization.
Where Pith is reading between the lines
- Widespread use might let smaller clinics or telehealth services handle more cases by letting one clinician review and adjust several LLM drafts.
- Long-term testing on diverse populations would be needed to check whether LLM patterns introduce systematic biases in factor emphasis.
- Pairing the graphs with outcome tracking could turn them into tools for testing which causal links predict treatment response.
- The denser interconnection style of LLM graphs might surface relationships that human chain-like graphs overlook, offering a complementary view for complex cases.
Load-bearing premise
That structural and semantic similarity to expert graphs plus moderate expert ratings is enough to establish clinical utility and safety for real patient care.
What would settle it
A controlled trial in which therapists using InsightFlow graphs produce treatment plans that differ substantially in effectiveness or safety from plans based on the same transcripts reviewed by experts without the graphs.
read the original abstract
Clinical case formulation organizes patient symptoms and psychosocial factors into causal models, often using the 5P framework. However, constructing such graphs from therapy transcripts is time consuming and varies across clinicians. We present InsightFlow, an LLM based approach that automatically generates 5P aligned causal graphs from patient-therapist dialogues. Using 46 psychotherapy intake transcripts annotated by clinical experts, we evaluate LLM generated graphs against human formulations using structural (NetSimile), semantic (embedding similarity), and expert rated clinical criteria. The generated graphs show structural similarity comparable to inter annotator agreement and high semantic alignment with human graphs. Expert evaluations rate the outputs as moderately complete, consistent, and clinically useful. While LLM graphs tend to form more interconnected structures compared to the chain like patterns of human graphs, overall complexity and content coverage are similar. These results suggest that LLMs can produce clinically meaningful case formulation graphs within the natural variability of expert practice. InsightFlow highlights the potential of automated causal modeling to augment clinical workflows, with future work needed to improve temporal reasoning and reduce redundancy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes InsightFlow, an LLM-driven system to automatically construct 5P causal graphs from patient-therapist dialogue transcripts for mental health case formulation. It evaluates the approach on 46 annotated psychotherapy intake transcripts using structural similarity via NetSimile, semantic similarity with embeddings, and expert ratings for clinical criteria. The findings indicate that the generated graphs achieve structural and semantic alignment comparable to human expert variability, with moderate ratings on usefulness, suggesting potential for augmenting clinical workflows despite some structural differences.
Significance. This work addresses a practical bottleneck in mental health care by automating the creation of causal models from transcripts. If the results are robust, it could lead to tools that improve consistency and efficiency in clinical practice. The multi-metric evaluation and use of real transcripts are positive aspects. However, the significance is tempered by the preliminary nature of the validation using only proxy measures without direct accuracy or safety assessments.
major comments (3)
- [§4 Evaluation] §4 Evaluation: The central claim that LLM graphs are 'within the natural variability of expert practice' relies on structural similarity being comparable to IAA, but specific quantitative values for NetSimile scores (for both LLM vs human and IAA), statistical tests, and details on how IAA was computed are not clearly reported, making it hard to verify the comparability.
- [§5 Discussion] §5 Discussion: The observation that LLM graphs are more interconnected than the chain-like human graphs is noted, but there is no analysis or validation of whether these extra edges correspond to clinically accurate causal relations or potential hallucinations, which directly impacts the claim of clinical meaningfulness and safety.
- [Methods (likely §3)] Methods (likely §3): Insufficient details are provided on the selection criteria for the 46 transcripts, the exact process of expert annotation for the 5P graphs, and the computation of inter-annotator agreement, which are essential for assessing the reliability of the evaluation setup.
minor comments (3)
- [Abstract] Typos and formatting: 'inter annotator' should be 'inter-annotator'; 'chain like' should be 'chain-like'.
- [Abstract] The abstract mentions 'positive structural and semantic alignment' but lacks any numerical values or specific metrics, which would help readers quickly gauge the results.
- [Throughout] Ensure all figures and tables have clear captions and that any code or data availability is stated explicitly for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped us identify opportunities to improve the transparency of our methods and evaluation. We address each major comment below, indicating where revisions have been made to the manuscript.
read point-by-point responses
-
Referee: [§4 Evaluation] §4 Evaluation: The central claim that LLM graphs are 'within the natural variability of expert practice' relies on structural similarity being comparable to IAA, but specific quantitative values for NetSimile scores (for both LLM vs human and IAA), statistical tests, and details on how IAA was computed are not clearly reported, making it hard to verify the comparability.
Authors: We agree that explicit reporting of these values is necessary for full verification of the claim. In the revised manuscript, we have added a table in §4 presenting the mean NetSimile scores for LLM-generated graphs versus human expert graphs alongside the IAA scores. We also report the results of statistical tests (paired t-tests) showing no significant difference between the two, and we have clarified that IAA was computed as the average pairwise NetSimile similarity across all pairs of the three independent expert annotations per transcript. revision: yes
-
Referee: [§5 Discussion] §5 Discussion: The observation that LLM graphs are more interconnected than the chain-like human graphs is noted, but there is no analysis or validation of whether these extra edges correspond to clinically accurate causal relations or potential hallucinations, which directly impacts the claim of clinical meaningfulness and safety.
Authors: We acknowledge this important point on the need for targeted validation of the additional edges. While the expert ratings of clinical usefulness, consistency, and completeness provide an overall assessment that encompasses relation quality, we have revised the Discussion to include a qualitative review of extra edges from sampled transcripts. This analysis indicates that many reflect plausible causal links present in the dialogue or standard clinical reasoning; however, we explicitly note the potential for hallucinations as a limitation and recommend future work involving larger-scale expert validation for safety-critical applications. revision: partial
-
Referee: Methods (likely §3): Insufficient details are provided on the selection criteria for the 46 transcripts, the exact process of expert annotation for the 5P graphs, and the computation of inter-annotator agreement, which are essential for assessing the reliability of the evaluation setup.
Authors: We appreciate the referee highlighting these methodological gaps. In the revised Methods section, we have added: the transcript selection criteria (random sampling from a larger de-identified corpus of intake sessions with stratification for demographic and clinical diversity); the annotation protocol (three licensed clinicians independently constructing 5P graphs by identifying factors and causal relations from explicit or implied content in the transcripts, guided by a standardized coding manual); and IAA details (pairwise NetSimile for structural agreement and embedding cosine similarity for semantic agreement, with consensus reference graphs formed after discussion of discrepancies). revision: yes
Circularity Check
No circularity: empirical evaluation against external annotations
full rationale
The paper presents an empirical pipeline that generates 5P causal graphs from transcripts via LLM prompting and then measures structural (NetSimile), semantic (embedding), and expert-rated similarity to independently produced human graphs. No derivation, equation, or 'prediction' is claimed that reduces by construction to fitted parameters, self-definitions, or a self-citation chain. The central claim rests on direct comparison to external human data (46 transcripts, multiple annotators), which is falsifiable and independent of the method itself. This matches the default expectation of a non-circular empirical study; the reported inter-annotator baselines and expert ratings serve as external benchmarks rather than tautological inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 5P framework provides a clinically valid causal structure for organizing patient information.
Reference graph
Works this paper leans on
-
[1]
Handbook of Psychotherapy Case Formulation
Eells T, editor. Handbook of Psychotherapy Case Formulation. 3rd ed. New York: Guilford Press; 2022. Available from: https://www.guilford.com/books/Handbook-of-Psychotherapy-Case-Formulation/Tracy-E ells/9781462548996
-
[2]
Persons JB. The case formulation approach to cognitive behavior therapy and practice-based research: a personal history of my approach to integrating science and practice. J Contemp Psychother. 2025;55(3):203-208. doi:10.1007/s10879-025-09668-8
-
[3]
Principles and Practice of Behavioral Assessment
Haynes SN, O'Brien WH. Principles and Practice of Behavioral Assessment. Dordrecht: Kluwer Academic Publishers; 2000. doi:10.1007/978-0-306-47469-9
-
[4]
Identifying causal relationships in clinical assessment
Haynes SN, Spain EH, Oliveira J. Identifying causal relationships in clinical assessment. Psychol Assess. 1993;5(3):281-291. doi:10.1037/1040-3590.5.3.281
-
[5]
Functional analysis in behavior therapy: behavioral foundations and clinical application
Virués-Ortega J, Haynes SN. Functional analysis in behavior therapy: behavioral foundations and clinical application. Int J Clin Health Psychol. 2005;5(3):567-587
2005
-
[6]
Core conflictual relationship theme: the reliability of a simplified scoring procedure
Tallberg P, Ulberg R, Johnsen Dahl HS, Høglend PA. Core conflictual relationship theme: the reliability of a simplified scoring procedure. BMC Psychiatry. 2020;20(1):150. doi:10.1186/s12888-020-02558-4
-
[7]
BERT: Pre-training of deep bidirectional transformers for language understanding
Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T, editors. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis:...
-
[8]
Language models are few-shot learners
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learne...
2020
-
[9]
Fitzpatrick KK, Darcy A, Vierhile M. Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): a randomized controlled trial. JMIR Ment Health. 2017;4(2):e19. doi:10.2196/mental.7785
-
[10]
Causal reasoning and large language models: opening a new frontier for causality
Kiciman E, Ness R, Sharma A, Tan C. Causal reasoning and large language models: opening a new frontier for causality. Trans Mach Learn Res. 2024;2835-8856. Available from: https://openreview.net/forum?id=mqoxLkX210
2024
-
[11]
PNAS Nexus
Perils and opportunities in using large language models in psychological research. PNAS Nexus. 2024;3(7):pgae245. Available from: https://academic.oup.com/pnasnexus/article/3/7/pgae245/7712371
2024
-
[12]
Psychological formulation as an alternative to psychiatric diagnosis
Johnstone L. Psychological formulation as an alternative to psychiatric diagnosis. J Humanist Psychol. 2018;58(1):30-46. doi:10.1177/0022167817722230
-
[13]
Formulation in Psychology and Psychotherapy: Making Sense of People's Problems
Johnstone L, Dallos R, editors. Formulation in Psychology and Psychotherapy: Making Sense of People's Problems. 2nd ed. London: Routledge; 2013. doi:10.4324/9780203380574
-
[14]
Formulation: a multiperspective model
Weerasekera P. Formulation: a multiperspective model. Can J Psychiatry. 1993;38(5):351-358. doi:10.1177/070674379303800513
-
[15]
Zero-shot causal graph extrapolation from text via LLMs
Antonucci A, Piqué G, Zaffalon M. Zero-shot causal graph extrapolation from text via LLMs. arXiv. Preprint posted online December 22, 2023. doi:10.48550/arXiv.2312.14670
-
[16]
Aziz M, Brookhart MA. Can contemporary large language models provide the domain knowledge needed for causal inference? Evaluating automated causal graph discovery through an ASCVD case study. Clin Epidemiol. 2025;17:863-873. doi:10.2147/CLEP.S550565
-
[17]
Artificial intelligence for mental health and mental illnesses: an overview
Graham S, Depp C, Lee EE, et al. Artificial intelligence for mental health and mental illnesses: an overview. Curr Psychiatry Rep. 2019;21(11):116. doi:10.1007/s11920-019-1094-0
-
[18]
npj Digit Med
Deep learning-enabled medical computer vision. npj Digit Med. 2021;4(1):5. Available from: https://www.nature.com/articles/s41746-020-00376-2
2021
-
[19]
Adhikary PK, Srivastava A, Kumar S, et al. Exploring the efficacy of large language models in summarizing mental health counseling sessions: benchmark study. JMIR Ment Health. 2024;11:e57306. doi:10.2196/57306
-
[20]
NetSimile: A scalable approach to size-independent network similarity
Berlingerio M, Koutra D, Eliassi-Rad T, Faloutsos C. NetSimile: a scalable approach to size-independent network similarity. arXiv. Preprint posted online September 12, 2012. doi:10.48550/arXiv.1209.2684
-
[21]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers N, Gurevych I. Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv. Preprint posted online August 27, 2019. doi:10.48550/arXiv.1908.10084
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1908.10084 2019
-
[22]
Centrality in social networks conceptual clarification
Freeman LC. Centrality in social networks conceptual clarification. Soc Networks. 1978;1(3):215-239. doi:10.1016/0378-8733(78)90021-7
-
[23]
Anthonisse JM. The rush in a directed graph. Amsterdam: Stichting Mathematisch Centrum; 1971. Available from: https://www.jstor.org/stable/3033543
-
[24]
Collective dynamics of 'small-world' networks
Watts D, Strogatz S. Collective dynamics of 'small-world' networks. Nature. 1998;393:440-442
1998
-
[25]
The structure and function of complex networks
Newman MEJ. The structure and function of complex networks. SIAM Rev. 2003;45(2):167-256. doi:10.1137/S003614450342480
-
[26]
Granovetter MS. The strength of weak ties. Am J Sociol. 1973;78(6):1360-1380. Available from: https://www.jstor.org/stable/2776392
-
[27]
Social Networks 31(3), 155–163 (2009) https://doi.org/10.1016/j.socnet.2009.02.002
Opsahl T, Panzarasa P. Clustering in weighted networks. Soc Networks. 2009;31(2):155-163. doi:10.1016/j.socnet.2009.02.002
-
[28]
Scientific Reports9(1), 5233 (2019) https://doi
Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. 2019;9(1):5233. doi:10.1038/s41598-019-41695-z
-
[29]
Community structure in social and biological networks
Girvan M, Newman MEJ. Community structure in social and biological networks. Proc Natl Acad Sci U S A. 2002;99(12):7821-7826. doi:10.1073/pnas.122653799
-
[30]
Maps of random walks on complex networks reveal community structure
Rosvall M, Bergstrom CT. Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci U S A. 2008;105(4):1118-1123. doi:10.1073/pnas.0706851105
-
[31]
Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22(1):79-86. doi:10.1214/aoms/1177729694
-
[32]
The earth mover's distance as a metric for image retrieval
Rubner Y, Tomasi C, Guibas LJ. The earth mover's distance as a metric for image retrieval. Int J Comput Vis. 2000;40(2):99-121. doi:10.1023/A:1026543900054
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.