arxiv: 2604.12721 · v1 · submitted 2026-04-14 · 💻 cs.CL

Recognition: unknown

InsightFlow: LLM-Driven Synthesis of Patient Narratives for Mental Health into Causal Models

Shreya Gupta , Prottay Kumar Adhikary , Bhavyaa Dave , Salam Michael Singh , Aniket Deroy , Tanmoy Chakraborty

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelscausal graphsmental healthcase formulation5P frameworkpsychotherapy transcriptsclinical utilitynarrative synthesis

0 comments

The pith

Large language models can generate causal graphs from therapy dialogues that match expert variability in structure and clinical usefulness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InsightFlow, an approach that prompts large language models to turn patient-therapist intake transcripts into causal graphs organized by the 5P clinical framework. Manual construction of these graphs is slow and differs between clinicians, so the method aims to produce initial versions automatically. On 46 expert-annotated transcripts, the LLM graphs reach structural similarity levels equal to human inter-annotator agreement, show strong semantic alignment via embeddings, and earn moderate expert scores for completeness, consistency, and usefulness. The automated graphs form denser networks than the chain-like human versions yet cover similar content and complexity overall. This positions LLMs as potential aids that stay inside the range of normal expert differences.

Core claim

InsightFlow uses large language models to automatically build 5P-aligned causal graphs from psychotherapy intake transcripts, linking symptoms and psychosocial factors into structured models. Evaluated against expert human formulations on 46 transcripts, the outputs achieve structural similarity comparable to inter-annotator agreement through NetSimile, high semantic similarity via embeddings, and moderate expert ratings on clinical criteria. LLM graphs tend to show more interconnections than the linear patterns in human graphs, while maintaining comparable overall complexity and content coverage. The results indicate that the generated graphs fall within the natural variability of expert 5P

What carries the argument

InsightFlow, an LLM pipeline that extracts and connects causal elements from dialogues into 5P-structured graphs, where 5P denotes the clinical categories of presenting problems, predisposing factors, precipitating factors, perpetuating factors, and protective factors.

If this is right

Clinicians could begin case formulation from an LLM draft rather than a blank page, reducing initial organization time.
Automated graphs might serve as a consistent base that different therapists can refine, narrowing some sources of inter-clinician variation.
Clinical software could integrate similar synthesis steps to produce causal overviews during or after intake sessions.
Further gains would require better handling of temporal order and reduction of redundant links in the generated graphs.
The same dialogue-to-graph method could be tested on other narrative-heavy domains that need causal organization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use might let smaller clinics or telehealth services handle more cases by letting one clinician review and adjust several LLM drafts.
Long-term testing on diverse populations would be needed to check whether LLM patterns introduce systematic biases in factor emphasis.
Pairing the graphs with outcome tracking could turn them into tools for testing which causal links predict treatment response.
The denser interconnection style of LLM graphs might surface relationships that human chain-like graphs overlook, offering a complementary view for complex cases.

Load-bearing premise

That structural and semantic similarity to expert graphs plus moderate expert ratings is enough to establish clinical utility and safety for real patient care.

What would settle it

A controlled trial in which therapists using InsightFlow graphs produce treatment plans that differ substantially in effectiveness or safety from plans based on the same transcripts reviewed by experts without the graphs.

read the original abstract

Clinical case formulation organizes patient symptoms and psychosocial factors into causal models, often using the 5P framework. However, constructing such graphs from therapy transcripts is time consuming and varies across clinicians. We present InsightFlow, an LLM based approach that automatically generates 5P aligned causal graphs from patient-therapist dialogues. Using 46 psychotherapy intake transcripts annotated by clinical experts, we evaluate LLM generated graphs against human formulations using structural (NetSimile), semantic (embedding similarity), and expert rated clinical criteria. The generated graphs show structural similarity comparable to inter annotator agreement and high semantic alignment with human graphs. Expert evaluations rate the outputs as moderately complete, consistent, and clinically useful. While LLM graphs tend to form more interconnected structures compared to the chain like patterns of human graphs, overall complexity and content coverage are similar. These results suggest that LLMs can produce clinically meaningful case formulation graphs within the natural variability of expert practice. InsightFlow highlights the potential of automated causal modeling to augment clinical workflows, with future work needed to improve temporal reasoning and reduce redundancy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InsightFlow gives a workable LLM pipeline for 5P causal graphs from intake transcripts, but similarity metrics and moderate ratings fall short of showing clinical accuracy or safety.

read the letter

The paper describes InsightFlow, an LLM system that takes raw psychotherapy intake transcripts and produces causal graphs aligned to the 5P framework. They run it on 46 expert-annotated cases and compare outputs to human graphs using NetSimile for structure, embedding similarity for semantics, and expert ratings on completeness, consistency, and usefulness. The LLM graphs come out roughly comparable on those measures and cover similar content, though they form denser networks than the chain-like human versions. That structural difference is noted and could be a useful observation for follow-up work. The concrete task and the multi-metric setup against real transcripts are the clearest contributions; the method is a straightforward extension of existing LLM summarization and graph-generation techniques rather than a wholly new idea. What the paper does well is keep the output tied to an established clinical structure instead of open-ended generation, and it reports that overall complexity stays in line with human work. The evaluation is empirical rather than circular, which is a plus. The main limitations sit in the strength of the evidence. The abstract supplies no specific numbers, statistical tests, or baseline comparisons, and details on transcript selection and inter-rater agreement are thin. More critically, structural and semantic similarity plus moderate expert scores do not directly test whether the extracted causal links are correct or would support safe clinical decisions. No edge-level accuracy against a gold standard or outcome-linked validation appears. The claim that the graphs fall within natural expert variability therefore rests on proxies that may not capture the aspects that matter most for use. This work is aimed at clinical NLP researchers and digital psychiatry groups who need a practical starting point for structured extraction. A reader interested in how LLMs handle domain-specific causal modeling will get value from the pipeline description and the noted differences in graph shape. It has enough real data and a defined task to merit serious peer review, though the authors should be asked to add quantitative detail and a tighter test of correctness before stronger claims are made.

Referee Report

3 major / 3 minor

Summary. The paper proposes InsightFlow, an LLM-driven system to automatically construct 5P causal graphs from patient-therapist dialogue transcripts for mental health case formulation. It evaluates the approach on 46 annotated psychotherapy intake transcripts using structural similarity via NetSimile, semantic similarity with embeddings, and expert ratings for clinical criteria. The findings indicate that the generated graphs achieve structural and semantic alignment comparable to human expert variability, with moderate ratings on usefulness, suggesting potential for augmenting clinical workflows despite some structural differences.

Significance. This work addresses a practical bottleneck in mental health care by automating the creation of causal models from transcripts. If the results are robust, it could lead to tools that improve consistency and efficiency in clinical practice. The multi-metric evaluation and use of real transcripts are positive aspects. However, the significance is tempered by the preliminary nature of the validation using only proxy measures without direct accuracy or safety assessments.

major comments (3)

[§4 Evaluation] §4 Evaluation: The central claim that LLM graphs are 'within the natural variability of expert practice' relies on structural similarity being comparable to IAA, but specific quantitative values for NetSimile scores (for both LLM vs human and IAA), statistical tests, and details on how IAA was computed are not clearly reported, making it hard to verify the comparability.
[§5 Discussion] §5 Discussion: The observation that LLM graphs are more interconnected than the chain-like human graphs is noted, but there is no analysis or validation of whether these extra edges correspond to clinically accurate causal relations or potential hallucinations, which directly impacts the claim of clinical meaningfulness and safety.
[Methods (likely §3)] Methods (likely §3): Insufficient details are provided on the selection criteria for the 46 transcripts, the exact process of expert annotation for the 5P graphs, and the computation of inter-annotator agreement, which are essential for assessing the reliability of the evaluation setup.

minor comments (3)

[Abstract] Typos and formatting: 'inter annotator' should be 'inter-annotator'; 'chain like' should be 'chain-like'.
[Abstract] The abstract mentions 'positive structural and semantic alignment' but lacks any numerical values or specific metrics, which would help readers quickly gauge the results.
[Throughout] Ensure all figures and tables have clear captions and that any code or data availability is stated explicitly for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify opportunities to improve the transparency of our methods and evaluation. We address each major comment below, indicating where revisions have been made to the manuscript.

read point-by-point responses

Referee: [§4 Evaluation] §4 Evaluation: The central claim that LLM graphs are 'within the natural variability of expert practice' relies on structural similarity being comparable to IAA, but specific quantitative values for NetSimile scores (for both LLM vs human and IAA), statistical tests, and details on how IAA was computed are not clearly reported, making it hard to verify the comparability.

Authors: We agree that explicit reporting of these values is necessary for full verification of the claim. In the revised manuscript, we have added a table in §4 presenting the mean NetSimile scores for LLM-generated graphs versus human expert graphs alongside the IAA scores. We also report the results of statistical tests (paired t-tests) showing no significant difference between the two, and we have clarified that IAA was computed as the average pairwise NetSimile similarity across all pairs of the three independent expert annotations per transcript. revision: yes
Referee: [§5 Discussion] §5 Discussion: The observation that LLM graphs are more interconnected than the chain-like human graphs is noted, but there is no analysis or validation of whether these extra edges correspond to clinically accurate causal relations or potential hallucinations, which directly impacts the claim of clinical meaningfulness and safety.

Authors: We acknowledge this important point on the need for targeted validation of the additional edges. While the expert ratings of clinical usefulness, consistency, and completeness provide an overall assessment that encompasses relation quality, we have revised the Discussion to include a qualitative review of extra edges from sampled transcripts. This analysis indicates that many reflect plausible causal links present in the dialogue or standard clinical reasoning; however, we explicitly note the potential for hallucinations as a limitation and recommend future work involving larger-scale expert validation for safety-critical applications. revision: partial
Referee: Methods (likely §3): Insufficient details are provided on the selection criteria for the 46 transcripts, the exact process of expert annotation for the 5P graphs, and the computation of inter-annotator agreement, which are essential for assessing the reliability of the evaluation setup.

Authors: We appreciate the referee highlighting these methodological gaps. In the revised Methods section, we have added: the transcript selection criteria (random sampling from a larger de-identified corpus of intake sessions with stratification for demographic and clinical diversity); the annotation protocol (three licensed clinicians independently constructing 5P graphs by identifying factors and causal relations from explicit or implied content in the transcripts, guided by a standardized coding manual); and IAA details (pairwise NetSimile for structural agreement and embedding cosine similarity for semantic agreement, with consensus reference graphs formed after discussion of discrepancies). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation against external annotations

full rationale

The paper presents an empirical pipeline that generates 5P causal graphs from transcripts via LLM prompting and then measures structural (NetSimile), semantic (embedding), and expert-rated similarity to independently produced human graphs. No derivation, equation, or 'prediction' is claimed that reduces by construction to fitted parameters, self-definitions, or a self-citation chain. The central claim rests on direct comparison to external human data (46 transcripts, multiple annotators), which is falsifiable and independent of the method itself. This matches the default expectation of a non-circular empirical study; the reported inter-annotator baselines and expert ratings serve as external benchmarks rather than tautological inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the established 5P clinical framework and the assumption that LLM outputs can be meaningfully compared to human graphs via existing similarity metrics.

axioms (1)

domain assumption The 5P framework provides a clinically valid causal structure for organizing patient information.
Invoked as the target output format for both human and LLM graphs.

pith-pipeline@v0.9.0 · 5504 in / 1176 out tokens · 31528 ms · 2026-05-10T15:23:40.417494+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 26 canonical work pages · 1 internal anchor

[1]

Handbook of Psychotherapy Case Formulation

Eells T, editor. Handbook of Psychotherapy Case Formulation. 3rd ed. New York: Guilford Press; 2022. Available from: https://www.guilford.com/books/Handbook-of-Psychotherapy-Case-Formulation/Tracy-E ells/9781462548996

work page arXiv 2022
[2]

The case formulation approach to cognitive behavior therapy and practice-based research: a personal history of my approach to integrating science and practice

Persons JB. The case formulation approach to cognitive behavior therapy and practice-based research: a personal history of my approach to integrating science and practice. J Contemp Psychother. 2025;55(3):203-208. doi:10.1007/s10879-025-09668-8

work page doi:10.1007/s10879-025-09668-8 2025
[3]

Principles and Practice of Behavioral Assessment

Haynes SN, O'Brien WH. Principles and Practice of Behavioral Assessment. Dordrecht: Kluwer Academic Publishers; 2000. doi:10.1007/978-0-306-47469-9

work page doi:10.1007/978-0-306-47469-9 2000
[4]

Identifying causal relationships in clinical assessment

Haynes SN, Spain EH, Oliveira J. Identifying causal relationships in clinical assessment. Psychol Assess. 1993;5(3):281-291. doi:10.1037/1040-3590.5.3.281

work page doi:10.1037/1040-3590.5.3.281 1993
[5]

Functional analysis in behavior therapy: behavioral foundations and clinical application

Virués-Ortega J, Haynes SN. Functional analysis in behavior therapy: behavioral foundations and clinical application. Int J Clin Health Psychol. 2005;5(3):567-587

2005
[6]

Core conflictual relationship theme: the reliability of a simplified scoring procedure

Tallberg P, Ulberg R, Johnsen Dahl HS, Høglend PA. Core conflictual relationship theme: the reliability of a simplified scoring procedure. BMC Psychiatry. 2020;20(1):150. doi:10.1186/s12888-020-02558-4

work page doi:10.1186/s12888-020-02558-4 2020
[7]

BERT: Pre-training of deep bidirectional transformers for language understanding

Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T, editors. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis:...

work page doi:10.18653/v1/n19-1423 2019
[8]

Language models are few-shot learners

Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learne...

2020
[9]

Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial

Fitzpatrick KK, Darcy A, Vierhile M. Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): a randomized controlled trial. JMIR Ment Health. 2017;4(2):e19. doi:10.2196/mental.7785

work page doi:10.2196/mental.7785 2017
[10]

Causal reasoning and large language models: opening a new frontier for causality

Kiciman E, Ness R, Sharma A, Tan C. Causal reasoning and large language models: opening a new frontier for causality. Trans Mach Learn Res. 2024;2835-8856. Available from: https://openreview.net/forum?id=mqoxLkX210

2024
[11]

PNAS Nexus

Perils and opportunities in using large language models in psychological research. PNAS Nexus. 2024;3(7):pgae245. Available from: https://academic.oup.com/pnasnexus/article/3/7/pgae245/7712371

2024
[12]

Psychological formulation as an alternative to psychiatric diagnosis

Johnstone L. Psychological formulation as an alternative to psychiatric diagnosis. J Humanist Psychol. 2018;58(1):30-46. doi:10.1177/0022167817722230

work page doi:10.1177/0022167817722230 2018
[13]

Formulation in Psychology and Psychotherapy: Making Sense of People's Problems

Johnstone L, Dallos R, editors. Formulation in Psychology and Psychotherapy: Making Sense of People's Problems. 2nd ed. London: Routledge; 2013. doi:10.4324/9780203380574

work page doi:10.4324/9780203380574 2013
[14]

Formulation: a multiperspective model

Weerasekera P. Formulation: a multiperspective model. Can J Psychiatry. 1993;38(5):351-358. doi:10.1177/070674379303800513

work page doi:10.1177/070674379303800513 1993
[15]

Zero-shot causal graph extrapolation from text via LLMs

Antonucci A, Piqué G, Zaffalon M. Zero-shot causal graph extrapolation from text via LLMs. arXiv. Preprint posted online December 22, 2023. doi:10.48550/arXiv.2312.14670

work page doi:10.48550/arxiv.2312.14670 2023
[16]

Can contemporary large language models provide the domain knowledge needed for causal inference? Evaluating automated causal graph discovery through an ASCVD case study

Aziz M, Brookhart MA. Can contemporary large language models provide the domain knowledge needed for causal inference? Evaluating automated causal graph discovery through an ASCVD case study. Clin Epidemiol. 2025;17:863-873. doi:10.2147/CLEP.S550565

work page doi:10.2147/clep.s550565 2025
[17]

Artificial intelligence for mental health and mental illnesses: an overview

Graham S, Depp C, Lee EE, et al. Artificial intelligence for mental health and mental illnesses: an overview. Curr Psychiatry Rep. 2019;21(11):116. doi:10.1007/s11920-019-1094-0

work page doi:10.1007/s11920-019-1094-0 2019
[18]

npj Digit Med

Deep learning-enabled medical computer vision. npj Digit Med. 2021;4(1):5. Available from: https://www.nature.com/articles/s41746-020-00376-2

2021
[19]

Exploring the efficacy of large language models in summarizing mental health counseling sessions: benchmark study

Adhikary PK, Srivastava A, Kumar S, et al. Exploring the efficacy of large language models in summarizing mental health counseling sessions: benchmark study. JMIR Ment Health. 2024;11:e57306. doi:10.2196/57306

work page doi:10.2196/57306 2024
[20]

NetSimile: A scalable approach to size-independent network similarity

Berlingerio M, Koutra D, Eliassi-Rad T, Faloutsos C. NetSimile: a scalable approach to size-independent network similarity. arXiv. Preprint posted online September 12, 2012. doi:10.48550/arXiv.1209.2684

work page doi:10.48550/arxiv.1209.2684 2012
[21]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers N, Gurevych I. Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv. Preprint posted online August 27, 2019. doi:10.48550/arXiv.1908.10084

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1908.10084 2019
[22]

Centrality in social networks conceptual clarification

Freeman LC. Centrality in social networks conceptual clarification. Soc Networks. 1978;1(3):215-239. doi:10.1016/0378-8733(78)90021-7

work page doi:10.1016/0378-8733(78)90021-7 1978
[23]

The rush in a directed graph

Anthonisse JM. The rush in a directed graph. Amsterdam: Stichting Mathematisch Centrum; 1971. Available from: https://www.jstor.org/stable/3033543

work page arXiv 1971
[24]

Collective dynamics of 'small-world' networks

Watts D, Strogatz S. Collective dynamics of 'small-world' networks. Nature. 1998;393:440-442

1998
[25]

The structure and function of complex networks

Newman MEJ. The structure and function of complex networks. SIAM Rev. 2003;45(2):167-256. doi:10.1137/S003614450342480

work page doi:10.1137/s003614450342480 2003
[26]

Granovetter

Granovetter MS. The strength of weak ties. Am J Sociol. 1973;78(6):1360-1380. Available from: https://www.jstor.org/stable/2776392

work page arXiv 1973
[27]

Social Networks 31(3), 155–163 (2009) https://doi.org/10.1016/j.socnet.2009.02.002

Opsahl T, Panzarasa P. Clustering in weighted networks. Soc Networks. 2009;31(2):155-163. doi:10.1016/j.socnet.2009.02.002

work page doi:10.1016/j.socnet.2009.02.002 2009
[28]

Scientific Reports9(1), 5233 (2019) https://doi

Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. 2019;9(1):5233. doi:10.1038/s41598-019-41695-z

work page doi:10.1038/s41598-019-41695-z 2019
[29]

Community structure in social and biological networks

Girvan M, Newman MEJ. Community structure in social and biological networks. Proc Natl Acad Sci U S A. 2002;99(12):7821-7826. doi:10.1073/pnas.122653799

work page doi:10.1073/pnas.122653799 2002
[30]

Maps of random walks on complex networks reveal community structure

Rosvall M, Bergstrom CT. Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci U S A. 2008;105(4):1118-1123. doi:10.1073/pnas.0706851105

work page doi:10.1073/pnas.0706851105 2008
[31]

Chernoff

Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22(1):79-86. doi:10.1214/aoms/1177729694

work page doi:10.1214/aoms/1177729694 1951
[32]

The earth mover's distance as a metric for image retrieval

Rubner Y, Tomasi C, Guibas LJ. The earth mover's distance as a metric for image retrieval. Int J Comput Vis. 2000;40(2):99-121. doi:10.1023/A:1026543900054

work page doi:10.1023/a:1026543900054 2000