ChatCLIDS: Simulating Persuasive AI Dialogues to Promote Closed-Loop Insulin Adoption in Type 1 Diabetes Care
Pith reviewed 2026-05-18 19:36 UTC · model grok-4.3
The pith
Current large language models struggle to persuade patients with type 1 diabetes to adopt closed-loop insulin systems, even as they adapt strategies over time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is the ChatCLIDS benchmark, which uses a library of expert-validated virtual patients with clinically grounded profiles and realistic adoption barriers to simulate multi-turn persuasive dialogues with nurse agents. The evaluation reveals that while larger and more reflective LLMs adapt their strategies over time, all models struggle to overcome resistance, particularly under realistic social pressure scenarios.
What carries the argument
A library of virtual patients with heterogeneous profiles and realistic barriers, combined with simulated interactions using evidence-based persuasive strategies.
If this is right
- AI systems for health behavior change require further development to handle social influences effectively.
- The benchmark enables testing of new persuasive techniques in a controlled, scalable way.
- Limitations in current LLMs highlight the need for hybrid approaches in healthcare applications.
- Longitudinal simulations reveal adaptation capabilities not visible in single-turn tests.
Where Pith is reading between the lines
- Similar benchmarks could be developed for other health behaviors such as medication adherence or lifestyle changes.
- Real-world deployment of persuasive AI might benefit from incorporating data on actual patient responses to refine the virtual models.
- Integrating external knowledge sources or multi-agent systems could help overcome the identified resistance barriers.
Load-bearing premise
The simulated virtual patients with their adoption barriers closely mirror the actual diversity and decision processes of real individuals with type 1 diabetes.
What would settle it
Conducting parallel real-world studies where actual type 1 diabetes patients interact with similar persuasive dialogues and comparing their adoption decisions to those of the virtual patients in ChatCLIDS.
Figures
read the original abstract
Real-world adoption of closed-loop insulin delivery systems (CLIDS) in type 1 diabetes remains low, driven not by technical failure, but by diverse behavioral, psychosocial, and social barriers. We introduce ChatCLIDS, the first benchmark to rigorously evaluate LLM-driven persuasive dialogue for health behavior change. Our framework features a library of expert-validated virtual patients, each with clinically grounded, heterogeneous profiles and realistic adoption barriers, and simulates multi-turn interactions with nurse agents equipped with a diverse set of evidence-based persuasive strategies. ChatCLIDS uniquely supports longitudinal counseling and adversarial social influence scenarios, enabling robust, multi-dimensional evaluation. Our findings reveal that while larger and more reflective LLMs adapt strategies over time, all models struggle to overcome resistance, especially under realistic social pressure. These results highlight critical limitations of current LLMs for behavior change, and offer a high-fidelity, scalable testbed for advancing trustworthy persuasive AI in healthcare and beyond.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ChatCLIDS, the first benchmark for evaluating LLM-driven persuasive dialogues to promote adoption of closed-loop insulin delivery systems (CLIDS) in type 1 diabetes care. The framework includes a library of expert-validated virtual patients with clinically grounded heterogeneous profiles and realistic adoption barriers, simulates multi-turn interactions between these patients and nurse agents using diverse evidence-based persuasive strategies, and supports longitudinal counseling as well as adversarial social influence scenarios. Key findings are that larger and more reflective LLMs adapt strategies over time, yet all models struggle to overcome resistance, particularly under realistic social pressure.
Significance. If the virtual patient simulations prove faithful to real patient dynamics, this work offers a high-fidelity, scalable, and reproducible testbed for assessing LLM capabilities in health behavior change. It directly addresses a clinically important adoption gap and highlights concrete limitations of current models for persuasive tasks involving psychosocial and social barriers, which could guide development of more trustworthy AI systems in healthcare.
major comments (2)
- [§3 (Virtual Patient Library)] §3 (Virtual Patient Library): The description states that the library consists of 'expert-validated' and 'clinically grounded' heterogeneous profiles with realistic barriers, yet no quantitative validation is reported (e.g., statistical comparison of simulated resistance trajectories, barrier persistence under social pressure, or response distributions against published T1D cohort studies or interview data). This is load-bearing for the central claim that models fail to overcome resistance 'especially under realistic social pressure,' as the observed struggles could reflect unvalidated simulation choices rather than generalizable behavioral limits.
- [§5 (Evaluation and Results)] §5 (Evaluation and Results): The reported adaptation of larger LLMs and universal struggles are presented as empirical observations, but the manuscript does not detail how the multi-dimensional metrics (e.g., strategy adaptation, resistance overcoming) were computed across longitudinal and adversarial conditions or whether sensitivity analyses were performed on key simulation parameters such as social influence strength. Without these, the robustness of the cross-model comparison is difficult to assess.
minor comments (2)
- [Abstract and §1] The abstract and introduction would benefit from an explicit statement of the number of LLMs evaluated, the precise set of persuasive strategies implemented, and the quantitative thresholds used to classify 'adaptation' versus 'struggle.'
- [Figures and Tables] Figure captions and table legends should clarify how virtual patient heterogeneity is sampled and how adversarial social influence is operationalized in the simulation loop.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments on our manuscript. We have carefully reviewed each major comment and provide detailed responses below, outlining how we will strengthen the paper through revisions.
read point-by-point responses
-
Referee: [§3 (Virtual Patient Library)] The description states that the library consists of 'expert-validated' and 'clinically grounded' heterogeneous profiles with realistic barriers, yet no quantitative validation is reported (e.g., statistical comparison of simulated resistance trajectories, barrier persistence under social pressure, or response distributions against published T1D cohort studies or interview data). This is load-bearing for the central claim that models fail to overcome resistance 'especially under realistic social pressure,' as the observed struggles could reflect unvalidated simulation choices rather than generalizable behavioral limits.
Authors: We agree that quantitative validation would provide stronger support for the fidelity of the virtual patient library and the generalizability of our findings on resistance under social pressure. The profiles were developed by synthesizing evidence from published T1D cohort studies and interview data on adoption barriers, followed by iterative review and refinement with input from endocrinologists and diabetes care specialists to ensure clinical grounding and realistic heterogeneity. While this expert-driven process establishes qualitative alignment with real-world observations, we acknowledge that formal statistical comparisons (e.g., matching resistance trajectories or barrier distributions) were not reported in the initial submission, as the primary focus was on introducing the benchmark framework. In the revised manuscript, we will expand §3 to explicitly describe the expert validation methodology, including the number of experts consulted and key feedback incorporated, and add a limitations subsection discussing the absence of quantitative benchmarking against specific cohort datasets. We will also clarify that the central claims are framed as observations within this validated simulation environment rather than direct generalizations, pending further empirical validation. revision: yes
-
Referee: [§5 (Evaluation and Results)] The reported adaptation of larger LLMs and universal struggles are presented as empirical observations, but the manuscript does not detail how the multi-dimensional metrics (e.g., strategy adaptation, resistance overcoming) were computed across longitudinal and adversarial conditions or whether sensitivity analyses were performed on key simulation parameters such as social influence strength. Without these, the robustness of the cross-model comparison is difficult to assess.
Authors: We appreciate the referee's emphasis on methodological transparency to support the robustness of the cross-model comparisons. The multi-dimensional metrics, including strategy adaptation (tracked via shifts in persuasive strategy selection across dialogue turns in response to patient feedback) and resistance overcoming (assessed through binary and scaled classifications of patient responses indicating reduced barriers or adoption intent), are introduced in §5 for both longitudinal (multi-session progression) and adversarial (social influence) conditions. However, we recognize that the exact computation procedures and any sensitivity testing on parameters such as social influence strength could be elaborated for greater clarity. In the revision, we will update §5 to include detailed descriptions of metric calculations (e.g., explicit definitions and pseudocode for adaptation scores and resistance reduction rates), along with results from sensitivity analyses varying social influence strength across low, medium, and high levels. These additions will demonstrate that the key findings on LLM adaptation and persistent struggles remain consistent, thereby addressing concerns about the reliability of the evaluations. revision: yes
Circularity Check
No circularity: empirical observations from new simulation benchmark
full rationale
The paper presents ChatCLIDS as a simulation framework with expert-validated virtual patients and multi-turn LLM interactions. Reported findings on strategy adaptation and resistance under social pressure are direct outputs of running the benchmark, not reductions to fitted parameters or self-definitional loops. No equations or derivations are claimed; the central contribution is the testbed itself. The virtual patient library is positioned as an input assumption rather than a derived result, with no load-bearing self-citation chains or ansatz smuggling. This is a standard self-contained benchmark paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Virtual patients with clinically grounded heterogeneous profiles accurately capture the diversity of real adoption barriers in type 1 diabetes.
- domain assumption Evidence-based persuasive strategies can be operationalized for LLM agents in multi-turn health dialogues.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2503.01829
Persuade me if you can: A framework for evaluat- ing persuasion effectiveness and susceptibility among large language models. arXiv preprint arXiv:2503.01829. Brake, N.; and Schaaf, T. 2024. Comparing Two Model Designs for Clinical Note Generation; Is an LLM a Useful Evaluator of Consistency? arXiv preprint arXiv:2404.06503. Cai, P.; Yao, Z.; Liu, F.; Wan...
-
[2]
GPTScore: Evaluate as You Desire
Review of the Omnipod® 5 automated glucose con- trol system powered by Horizon™ for the treatment of type 1 diabetes. Therapeutic delivery, 11(8): 507–519. Croxford, E.; Gao, Y .; First, E.; Pellegrino, N.; Schnier, M.; Caskey, J.; Oguss, M.; Wills, G.; Chen, G.; Dligach, D.; et al. 2025. Automating Evaluation of AI Text Generation in Healthcare with a La...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals
Impact of patient engagement on healthcare qual- ity: a scoping review. Journal of patient experience , 9: 23743735221125439. Messer, L. H.; Berget, C.; Vigers, T.; Pyle, L.; Geno, C.; Wadwa, R. P.; Driscoll, K. A.; and Forlenza, G. P. 2020. Real world hybrid closed-loop discontinuation: predictors and perceptions of youth discontinuing the 670G system in...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[4]
arXiv preprint arXiv:2401.05654
Towards conversational diagnostic AI. arXiv preprint arXiv:2401.05654. Tu, T.; Schaekermann, M.; Palepu, A.; Saab, K.; Freyberg, J.; Tanno, R.; Wang, A.; Li, B.; Amin, M.; Cheng, Y .; et al
-
[5]
Towards conversational diagnostic artificial intelli- gence. Nature, 1–9. Wang, J.; Yao, Z.; Yang, Z.; Zhou, H.; Li, R.; Wang, X.; Xu, Y .; and Yu, H. 2023. NoteChat: a dataset of synthetic doctor-patient conversations conditioned on clinical notes. arXiv preprint arXiv:2310.15959. Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang...
-
[6]
How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humaniz- ing llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 14322–14350. Zhang, C.; D’Haro, L. F.; Chen, Y .; Zhang, M.; and Li, H
-
[7]
Nurse Response and Patient Follow-up
A comprehensive analysis of the effectiveness of large language models as automatic dialogue evaluators. In Pro- ceedings of the AAAI Conference on Artificial Intelligence , volume 38, 19515–19524. Zhang, Z.; Yao, Z.; Zhou, H.; Yu, H.; et al. 2023. Ehrtutor: Enhancing patient understanding of discharge instructions. arXiv preprint arXiv:2310.19212. Zheng,...
-
[8]
Persuasion Rating Change Justifiability: Is the change in persuasion rating plausible given the dialogue? (Yes/No; if “No,” provide a brief explanation.)
-
[9]
Patient Behavioral Realism: Does the patient behave like a real patient? (Yes/No; if “No,” provide a brief explanation.) • Comments: Highlight especially good or problematic cases. Annotation Template Example: Table 3 and 4 Scoring notes: Ties are permitted but should be used spar- ingly. If either patient-related field is “No”, a brief justifi- cation is...
-
[10]
Indicate if the model’s response is better than your own (Yes/No)
-
[11]
Score the response on the following six criteria, each from 1.0 (poor) to 5.0 (excellent), using decimals as appropriate: (a) Responsiveness (b) Empathy (c) Persuasive Strategy Appropriateness (d) Clinical Relevance (e) Nurse Behavioral Realism (f) Persuasion Rating Change Justifiability
-
[12]
For each score, provide a brief 1–2 sentence justifica- tion
-
[13]
Optionally, leave comments for any notable responses. Annotation Template Example: Table 5 Scoring notes: Use decimals and avoid inflation; scores above 4.0 are reserved for clearly exceptional performance. Experts are encouraged to add comments for ambiguous or outstanding cases. Results can be found in Table 6 Stage 3: Multi-Turn Case Study Interviews O...
-
[14]
Is the nurse’s reflection after each visit reasonable from a clinical perspective?
-
[15]
Does the nurse’s behavior in the next visit reflect the previous visit’s reflection?
-
[16]
How do the above patterns evolve over successive vis- its?
-
[17]
Is the patient’s behavior consistent across visits?
-
[18]
Does the patient agent accurately reflect the range of real patient responses over time? • For Social Resistance cases, additionally:
-
[19]
Is the Social Resistance’s intervention after each visit realistic and plausible?
-
[20]
Are there clear cases where Social Resistance influ- ence prevented successful persuasion in subsequent visits?
-
[21]
How does Social Resistance’s impact change as the scenario unfolds? • Annotate key dialogue snippets as evidence, and summa- rize qualitative findings. Annotation Template Example: Table 7 LLM-as-Judge Evaluation Conversation History Nurse Response and Patient Follow-up Persuasion Rating Change Justifi- able? Patient Behavioral Realism? Patient: I underst...
work page 2024
-
[22]
financial assistance programs and insurance guidance. I’ll send it to the address you provided so you can review it at your own pace, no obligation or follow-up unless you want it. If you have any questions, just hit reply and I’ll get back to you right away. Does that plan work for you? [Foot-in- the-door, Alliance Building, Priming, Anchoring, Encourage...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.