pith. machine review for the scientific record. sign in

arxiv: 2604.12721 · v1 · submitted 2026-04-14 · 💻 cs.CL

Recognition: unknown

InsightFlow: LLM-Driven Synthesis of Patient Narratives for Mental Health into Causal Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:23 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelscausal graphsmental healthcase formulation5P frameworkpsychotherapy transcriptsclinical utilitynarrative synthesis
0
0 comments X

The pith

Large language models can generate causal graphs from therapy dialogues that match expert variability in structure and clinical usefulness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InsightFlow, an approach that prompts large language models to turn patient-therapist intake transcripts into causal graphs organized by the 5P clinical framework. Manual construction of these graphs is slow and differs between clinicians, so the method aims to produce initial versions automatically. On 46 expert-annotated transcripts, the LLM graphs reach structural similarity levels equal to human inter-annotator agreement, show strong semantic alignment via embeddings, and earn moderate expert scores for completeness, consistency, and usefulness. The automated graphs form denser networks than the chain-like human versions yet cover similar content and complexity overall. This positions LLMs as potential aids that stay inside the range of normal expert differences.

Core claim

InsightFlow uses large language models to automatically build 5P-aligned causal graphs from psychotherapy intake transcripts, linking symptoms and psychosocial factors into structured models. Evaluated against expert human formulations on 46 transcripts, the outputs achieve structural similarity comparable to inter-annotator agreement through NetSimile, high semantic similarity via embeddings, and moderate expert ratings on clinical criteria. LLM graphs tend to show more interconnections than the linear patterns in human graphs, while maintaining comparable overall complexity and content coverage. The results indicate that the generated graphs fall within the natural variability of expert 5P

What carries the argument

InsightFlow, an LLM pipeline that extracts and connects causal elements from dialogues into 5P-structured graphs, where 5P denotes the clinical categories of presenting problems, predisposing factors, precipitating factors, perpetuating factors, and protective factors.

If this is right

  • Clinicians could begin case formulation from an LLM draft rather than a blank page, reducing initial organization time.
  • Automated graphs might serve as a consistent base that different therapists can refine, narrowing some sources of inter-clinician variation.
  • Clinical software could integrate similar synthesis steps to produce causal overviews during or after intake sessions.
  • Further gains would require better handling of temporal order and reduction of redundant links in the generated graphs.
  • The same dialogue-to-graph method could be tested on other narrative-heavy domains that need causal organization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread use might let smaller clinics or telehealth services handle more cases by letting one clinician review and adjust several LLM drafts.
  • Long-term testing on diverse populations would be needed to check whether LLM patterns introduce systematic biases in factor emphasis.
  • Pairing the graphs with outcome tracking could turn them into tools for testing which causal links predict treatment response.
  • The denser interconnection style of LLM graphs might surface relationships that human chain-like graphs overlook, offering a complementary view for complex cases.

Load-bearing premise

That structural and semantic similarity to expert graphs plus moderate expert ratings is enough to establish clinical utility and safety for real patient care.

What would settle it

A controlled trial in which therapists using InsightFlow graphs produce treatment plans that differ substantially in effectiveness or safety from plans based on the same transcripts reviewed by experts without the graphs.

read the original abstract

Clinical case formulation organizes patient symptoms and psychosocial factors into causal models, often using the 5P framework. However, constructing such graphs from therapy transcripts is time consuming and varies across clinicians. We present InsightFlow, an LLM based approach that automatically generates 5P aligned causal graphs from patient-therapist dialogues. Using 46 psychotherapy intake transcripts annotated by clinical experts, we evaluate LLM generated graphs against human formulations using structural (NetSimile), semantic (embedding similarity), and expert rated clinical criteria. The generated graphs show structural similarity comparable to inter annotator agreement and high semantic alignment with human graphs. Expert evaluations rate the outputs as moderately complete, consistent, and clinically useful. While LLM graphs tend to form more interconnected structures compared to the chain like patterns of human graphs, overall complexity and content coverage are similar. These results suggest that LLMs can produce clinically meaningful case formulation graphs within the natural variability of expert practice. InsightFlow highlights the potential of automated causal modeling to augment clinical workflows, with future work needed to improve temporal reasoning and reduce redundancy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes InsightFlow, an LLM-driven system to automatically construct 5P causal graphs from patient-therapist dialogue transcripts for mental health case formulation. It evaluates the approach on 46 annotated psychotherapy intake transcripts using structural similarity via NetSimile, semantic similarity with embeddings, and expert ratings for clinical criteria. The findings indicate that the generated graphs achieve structural and semantic alignment comparable to human expert variability, with moderate ratings on usefulness, suggesting potential for augmenting clinical workflows despite some structural differences.

Significance. This work addresses a practical bottleneck in mental health care by automating the creation of causal models from transcripts. If the results are robust, it could lead to tools that improve consistency and efficiency in clinical practice. The multi-metric evaluation and use of real transcripts are positive aspects. However, the significance is tempered by the preliminary nature of the validation using only proxy measures without direct accuracy or safety assessments.

major comments (3)
  1. [§4 Evaluation] §4 Evaluation: The central claim that LLM graphs are 'within the natural variability of expert practice' relies on structural similarity being comparable to IAA, but specific quantitative values for NetSimile scores (for both LLM vs human and IAA), statistical tests, and details on how IAA was computed are not clearly reported, making it hard to verify the comparability.
  2. [§5 Discussion] §5 Discussion: The observation that LLM graphs are more interconnected than the chain-like human graphs is noted, but there is no analysis or validation of whether these extra edges correspond to clinically accurate causal relations or potential hallucinations, which directly impacts the claim of clinical meaningfulness and safety.
  3. [Methods (likely §3)] Methods (likely §3): Insufficient details are provided on the selection criteria for the 46 transcripts, the exact process of expert annotation for the 5P graphs, and the computation of inter-annotator agreement, which are essential for assessing the reliability of the evaluation setup.
minor comments (3)
  1. [Abstract] Typos and formatting: 'inter annotator' should be 'inter-annotator'; 'chain like' should be 'chain-like'.
  2. [Abstract] The abstract mentions 'positive structural and semantic alignment' but lacks any numerical values or specific metrics, which would help readers quickly gauge the results.
  3. [Throughout] Ensure all figures and tables have clear captions and that any code or data availability is stated explicitly for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify opportunities to improve the transparency of our methods and evaluation. We address each major comment below, indicating where revisions have been made to the manuscript.

read point-by-point responses
  1. Referee: [§4 Evaluation] §4 Evaluation: The central claim that LLM graphs are 'within the natural variability of expert practice' relies on structural similarity being comparable to IAA, but specific quantitative values for NetSimile scores (for both LLM vs human and IAA), statistical tests, and details on how IAA was computed are not clearly reported, making it hard to verify the comparability.

    Authors: We agree that explicit reporting of these values is necessary for full verification of the claim. In the revised manuscript, we have added a table in §4 presenting the mean NetSimile scores for LLM-generated graphs versus human expert graphs alongside the IAA scores. We also report the results of statistical tests (paired t-tests) showing no significant difference between the two, and we have clarified that IAA was computed as the average pairwise NetSimile similarity across all pairs of the three independent expert annotations per transcript. revision: yes

  2. Referee: [§5 Discussion] §5 Discussion: The observation that LLM graphs are more interconnected than the chain-like human graphs is noted, but there is no analysis or validation of whether these extra edges correspond to clinically accurate causal relations or potential hallucinations, which directly impacts the claim of clinical meaningfulness and safety.

    Authors: We acknowledge this important point on the need for targeted validation of the additional edges. While the expert ratings of clinical usefulness, consistency, and completeness provide an overall assessment that encompasses relation quality, we have revised the Discussion to include a qualitative review of extra edges from sampled transcripts. This analysis indicates that many reflect plausible causal links present in the dialogue or standard clinical reasoning; however, we explicitly note the potential for hallucinations as a limitation and recommend future work involving larger-scale expert validation for safety-critical applications. revision: partial

  3. Referee: Methods (likely §3): Insufficient details are provided on the selection criteria for the 46 transcripts, the exact process of expert annotation for the 5P graphs, and the computation of inter-annotator agreement, which are essential for assessing the reliability of the evaluation setup.

    Authors: We appreciate the referee highlighting these methodological gaps. In the revised Methods section, we have added: the transcript selection criteria (random sampling from a larger de-identified corpus of intake sessions with stratification for demographic and clinical diversity); the annotation protocol (three licensed clinicians independently constructing 5P graphs by identifying factors and causal relations from explicit or implied content in the transcripts, guided by a standardized coding manual); and IAA details (pairwise NetSimile for structural agreement and embedding cosine similarity for semantic agreement, with consensus reference graphs formed after discussion of discrepancies). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation against external annotations

full rationale

The paper presents an empirical pipeline that generates 5P causal graphs from transcripts via LLM prompting and then measures structural (NetSimile), semantic (embedding), and expert-rated similarity to independently produced human graphs. No derivation, equation, or 'prediction' is claimed that reduces by construction to fitted parameters, self-definitions, or a self-citation chain. The central claim rests on direct comparison to external human data (46 transcripts, multiple annotators), which is falsifiable and independent of the method itself. This matches the default expectation of a non-circular empirical study; the reported inter-annotator baselines and expert ratings serve as external benchmarks rather than tautological inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the established 5P clinical framework and the assumption that LLM outputs can be meaningfully compared to human graphs via existing similarity metrics.

axioms (1)
  • domain assumption The 5P framework provides a clinically valid causal structure for organizing patient information.
    Invoked as the target output format for both human and LLM graphs.

pith-pipeline@v0.9.0 · 5504 in / 1176 out tokens · 31528 ms · 2026-05-10T15:23:40.417494+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 26 canonical work pages · 1 internal anchor

  1. [1]

    Handbook of Psychotherapy Case Formulation

    Eells T, editor. Handbook of Psychotherapy Case Formulation. 3rd ed. New York: Guilford Press; 2022. Available from: https://www.guilford.com/books/Handbook-of-Psychotherapy-Case-Formulation/Tracy-E ells/9781462548996

  2. [2]

    The case formulation approach to cognitive behavior therapy and practice-based research: a personal history of my approach to integrating science and practice

    Persons JB. The case formulation approach to cognitive behavior therapy and practice-based research: a personal history of my approach to integrating science and practice. J Contemp Psychother. 2025;55(3):203-208. doi:10.1007/s10879-025-09668-8

  3. [3]

    Principles and Practice of Behavioral Assessment

    Haynes SN, O'Brien WH. Principles and Practice of Behavioral Assessment. Dordrecht: Kluwer Academic Publishers; 2000. doi:10.1007/978-0-306-47469-9

  4. [4]

    Identifying causal relationships in clinical assessment

    Haynes SN, Spain EH, Oliveira J. Identifying causal relationships in clinical assessment. Psychol Assess. 1993;5(3):281-291. doi:10.1037/1040-3590.5.3.281

  5. [5]

    Functional analysis in behavior therapy: behavioral foundations and clinical application

    Virués-Ortega J, Haynes SN. Functional analysis in behavior therapy: behavioral foundations and clinical application. Int J Clin Health Psychol. 2005;5(3):567-587

  6. [6]

    Core conflictual relationship theme: the reliability of a simplified scoring procedure

    Tallberg P, Ulberg R, Johnsen Dahl HS, Høglend PA. Core conflictual relationship theme: the reliability of a simplified scoring procedure. BMC Psychiatry. 2020;20(1):150. doi:10.1186/s12888-020-02558-4

  7. [7]

    BERT: Pre-training of deep bidirectional transformers for language understanding

    Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T, editors. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis:...

  8. [8]

    Language models are few-shot learners

    Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learne...

  9. [9]

    Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial

    Fitzpatrick KK, Darcy A, Vierhile M. Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): a randomized controlled trial. JMIR Ment Health. 2017;4(2):e19. doi:10.2196/mental.7785

  10. [10]

    Causal reasoning and large language models: opening a new frontier for causality

    Kiciman E, Ness R, Sharma A, Tan C. Causal reasoning and large language models: opening a new frontier for causality. Trans Mach Learn Res. 2024;2835-8856. Available from: https://openreview.net/forum?id=mqoxLkX210

  11. [11]

    PNAS Nexus

    Perils and opportunities in using large language models in psychological research. PNAS Nexus. 2024;3(7):pgae245. Available from: https://academic.oup.com/pnasnexus/article/3/7/pgae245/7712371

  12. [12]

    Psychological formulation as an alternative to psychiatric diagnosis

    Johnstone L. Psychological formulation as an alternative to psychiatric diagnosis. J Humanist Psychol. 2018;58(1):30-46. doi:10.1177/0022167817722230

  13. [13]

    Formulation in Psychology and Psychotherapy: Making Sense of People's Problems

    Johnstone L, Dallos R, editors. Formulation in Psychology and Psychotherapy: Making Sense of People's Problems. 2nd ed. London: Routledge; 2013. doi:10.4324/9780203380574

  14. [14]

    Formulation: a multiperspective model

    Weerasekera P. Formulation: a multiperspective model. Can J Psychiatry. 1993;38(5):351-358. doi:10.1177/070674379303800513

  15. [15]

    Zero-shot causal graph extrapolation from text via LLMs

    Antonucci A, Piqué G, Zaffalon M. Zero-shot causal graph extrapolation from text via LLMs. arXiv. Preprint posted online December 22, 2023. doi:10.48550/arXiv.2312.14670

  16. [16]

    Can contemporary large language models provide the domain knowledge needed for causal inference? Evaluating automated causal graph discovery through an ASCVD case study

    Aziz M, Brookhart MA. Can contemporary large language models provide the domain knowledge needed for causal inference? Evaluating automated causal graph discovery through an ASCVD case study. Clin Epidemiol. 2025;17:863-873. doi:10.2147/CLEP.S550565

  17. [17]

    Artificial intelligence for mental health and mental illnesses: an overview

    Graham S, Depp C, Lee EE, et al. Artificial intelligence for mental health and mental illnesses: an overview. Curr Psychiatry Rep. 2019;21(11):116. doi:10.1007/s11920-019-1094-0

  18. [18]

    npj Digit Med

    Deep learning-enabled medical computer vision. npj Digit Med. 2021;4(1):5. Available from: https://www.nature.com/articles/s41746-020-00376-2

  19. [19]

    Exploring the efficacy of large language models in summarizing mental health counseling sessions: benchmark study

    Adhikary PK, Srivastava A, Kumar S, et al. Exploring the efficacy of large language models in summarizing mental health counseling sessions: benchmark study. JMIR Ment Health. 2024;11:e57306. doi:10.2196/57306

  20. [20]

    NetSimile: A scalable approach to size-independent network similarity

    Berlingerio M, Koutra D, Eliassi-Rad T, Faloutsos C. NetSimile: a scalable approach to size-independent network similarity. arXiv. Preprint posted online September 12, 2012. doi:10.48550/arXiv.1209.2684

  21. [21]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Reimers N, Gurevych I. Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv. Preprint posted online August 27, 2019. doi:10.48550/arXiv.1908.10084

  22. [22]

    Centrality in social networks conceptual clarification

    Freeman LC. Centrality in social networks conceptual clarification. Soc Networks. 1978;1(3):215-239. doi:10.1016/0378-8733(78)90021-7

  23. [23]

    The rush in a directed graph

    Anthonisse JM. The rush in a directed graph. Amsterdam: Stichting Mathematisch Centrum; 1971. Available from: https://www.jstor.org/stable/3033543

  24. [24]

    Collective dynamics of 'small-world' networks

    Watts D, Strogatz S. Collective dynamics of 'small-world' networks. Nature. 1998;393:440-442

  25. [25]

    The structure and function of complex networks

    Newman MEJ. The structure and function of complex networks. SIAM Rev. 2003;45(2):167-256. doi:10.1137/S003614450342480

  26. [26]

    Granovetter

    Granovetter MS. The strength of weak ties. Am J Sociol. 1973;78(6):1360-1380. Available from: https://www.jstor.org/stable/2776392

  27. [27]

    Social Networks 31(3), 155–163 (2009) https://doi.org/10.1016/j.socnet.2009.02.002

    Opsahl T, Panzarasa P. Clustering in weighted networks. Soc Networks. 2009;31(2):155-163. doi:10.1016/j.socnet.2009.02.002

  28. [28]

    Scientific Reports9(1), 5233 (2019) https://doi

    Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. 2019;9(1):5233. doi:10.1038/s41598-019-41695-z

  29. [29]

    Community structure in social and biological networks

    Girvan M, Newman MEJ. Community structure in social and biological networks. Proc Natl Acad Sci U S A. 2002;99(12):7821-7826. doi:10.1073/pnas.122653799

  30. [30]

    Maps of random walks on complex networks reveal community structure

    Rosvall M, Bergstrom CT. Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci U S A. 2008;105(4):1118-1123. doi:10.1073/pnas.0706851105

  31. [31]

    Chernoff

    Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22(1):79-86. doi:10.1214/aoms/1177729694

  32. [32]

    The earth mover's distance as a metric for image retrieval

    Rubner Y, Tomasi C, Guibas LJ. The earth mover's distance as a metric for image retrieval. Int J Comput Vis. 2000;40(2):99-121. doi:10.1023/A:1026543900054