arxiv: 2509.00891 · v3 · submitted 2025-08-31 · 💻 cs.AI · cs.CL

ChatCLIDS: Simulating Persuasive AI Dialogues to Promote Closed-Loop Insulin Adoption in Type 1 Diabetes Care

Zonghai Yao , Talha Chafekar , Junda Wang , Shuo Han , Feiyun Ouyang , Junhui Qian , Lingxi Li , Hong Yu This is my paper

Pith reviewed 2026-05-18 19:36 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords persuasive AILLM evaluationtype 1 diabetesclosed-loop insulin deliverybehavior changevirtual patientshealthcare simulationChatCLIDS

0 comments p. Extension

The pith

Current large language models struggle to persuade patients with type 1 diabetes to adopt closed-loop insulin systems, even as they adapt strategies over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates ChatCLIDS, a benchmark that tests AI systems on their ability to change health behaviors through conversation. It builds a set of virtual patients who have realistic reasons to resist new insulin technology and then runs simulated counseling sessions using different persuasive approaches. The results show that bigger AI models can shift their tactics during longer talks, but none succeed much when the patient feels pressure from others around them. Understanding this limitation matters because AI is increasingly proposed as a tool to help people manage chronic diseases like diabetes.

Core claim

The central discovery is the ChatCLIDS benchmark, which uses a library of expert-validated virtual patients with clinically grounded profiles and realistic adoption barriers to simulate multi-turn persuasive dialogues with nurse agents. The evaluation reveals that while larger and more reflective LLMs adapt their strategies over time, all models struggle to overcome resistance, particularly under realistic social pressure scenarios.

What carries the argument

A library of virtual patients with heterogeneous profiles and realistic barriers, combined with simulated interactions using evidence-based persuasive strategies.

If this is right

AI systems for health behavior change require further development to handle social influences effectively.
The benchmark enables testing of new persuasive techniques in a controlled, scalable way.
Limitations in current LLMs highlight the need for hybrid approaches in healthcare applications.
Longitudinal simulations reveal adaptation capabilities not visible in single-turn tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar benchmarks could be developed for other health behaviors such as medication adherence or lifestyle changes.
Real-world deployment of persuasive AI might benefit from incorporating data on actual patient responses to refine the virtual models.
Integrating external knowledge sources or multi-agent systems could help overcome the identified resistance barriers.

Load-bearing premise

The simulated virtual patients with their adoption barriers closely mirror the actual diversity and decision processes of real individuals with type 1 diabetes.

What would settle it

Conducting parallel real-world studies where actual type 1 diabetes patients interact with similar persuasive dialogues and comparing their adoption decisions to those of the virtual patients in ChatCLIDS.

Figures

Figures reproduced from arXiv: 2509.00891 by Feiyun Ouyang, Hong Yu, Junda Wang, Junhui Qian, Lingxi Li, Shuo Han, Talha Chafekar, Zonghai Yao.

**Figure 1.** Figure 1: Structure of the Patient Agent in ChatCLIDS. Each agent is initialized with a clinically validated profile and a scenario-driven set of adoption barriers. The resulting diversity in persuasion barriers and conversational responses enables personalized, realistic, and challenging evaluation of persuasive dialogue systems. dinal dynamics of dialogue. The framework also includes a Social Resistance Agent to m… view at source ↗

**Figure 2.** Figure 2: Overview of the ChatCLIDS. The framework evaluates LLM-based persuasive dialogues between Nurse and Patient Agents in the context of insulin pump adoption. The left panel illustrates the multi-step agent reasoning and the taxonomy of 31 persuasive strategies. In contrast, the right panel highlights benchmark features, including stratified patient difficulty, multisession dialogue, and adversarial social i… view at source ↗

**Figure 3.** Figure 3: Model performance trajectories in longitudinal persuasion. Each subplot shows the visit-wise progression of average persuasion ratings across models and settings. Circles indicate initial scores for each visit; arrows show change after nurse intervention. Top row: Multi-Visit results (a: Medium, b: Hard); Bottom row: Social Resistance results (c: Medium, d: Hard). The impact of adversarial social input is … view at source ↗

**Figure 4.** Figure 4: DeepseekR1 across the visits for multi-visit exper [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

Real-world adoption of closed-loop insulin delivery systems (CLIDS) in type 1 diabetes remains low, driven not by technical failure, but by diverse behavioral, psychosocial, and social barriers. We introduce ChatCLIDS, the first benchmark to rigorously evaluate LLM-driven persuasive dialogue for health behavior change. Our framework features a library of expert-validated virtual patients, each with clinically grounded, heterogeneous profiles and realistic adoption barriers, and simulates multi-turn interactions with nurse agents equipped with a diverse set of evidence-based persuasive strategies. ChatCLIDS uniquely supports longitudinal counseling and adversarial social influence scenarios, enabling robust, multi-dimensional evaluation. Our findings reveal that while larger and more reflective LLMs adapt strategies over time, all models struggle to overcome resistance, especially under realistic social pressure. These results highlight critical limitations of current LLMs for behavior change, and offer a high-fidelity, scalable testbed for advancing trustworthy persuasive AI in healthcare and beyond.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ChatCLIDS, the first benchmark for evaluating LLM-driven persuasive dialogues to promote adoption of closed-loop insulin delivery systems (CLIDS) in type 1 diabetes care. The framework includes a library of expert-validated virtual patients with clinically grounded heterogeneous profiles and realistic adoption barriers, simulates multi-turn interactions between these patients and nurse agents using diverse evidence-based persuasive strategies, and supports longitudinal counseling as well as adversarial social influence scenarios. Key findings are that larger and more reflective LLMs adapt strategies over time, yet all models struggle to overcome resistance, particularly under realistic social pressure.

Significance. If the virtual patient simulations prove faithful to real patient dynamics, this work offers a high-fidelity, scalable, and reproducible testbed for assessing LLM capabilities in health behavior change. It directly addresses a clinically important adoption gap and highlights concrete limitations of current models for persuasive tasks involving psychosocial and social barriers, which could guide development of more trustworthy AI systems in healthcare.

major comments (2)

[§3 (Virtual Patient Library)] §3 (Virtual Patient Library): The description states that the library consists of 'expert-validated' and 'clinically grounded' heterogeneous profiles with realistic barriers, yet no quantitative validation is reported (e.g., statistical comparison of simulated resistance trajectories, barrier persistence under social pressure, or response distributions against published T1D cohort studies or interview data). This is load-bearing for the central claim that models fail to overcome resistance 'especially under realistic social pressure,' as the observed struggles could reflect unvalidated simulation choices rather than generalizable behavioral limits.
[§5 (Evaluation and Results)] §5 (Evaluation and Results): The reported adaptation of larger LLMs and universal struggles are presented as empirical observations, but the manuscript does not detail how the multi-dimensional metrics (e.g., strategy adaptation, resistance overcoming) were computed across longitudinal and adversarial conditions or whether sensitivity analyses were performed on key simulation parameters such as social influence strength. Without these, the robustness of the cross-model comparison is difficult to assess.

minor comments (2)

[Abstract and §1] The abstract and introduction would benefit from an explicit statement of the number of LLMs evaluated, the precise set of persuasive strategies implemented, and the quantitative thresholds used to classify 'adaptation' versus 'struggle.'
[Figures and Tables] Figure captions and table legends should clarify how virtual patient heterogeneity is sampled and how adversarial social influence is operationalized in the simulation loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our manuscript. We have carefully reviewed each major comment and provide detailed responses below, outlining how we will strengthen the paper through revisions.

read point-by-point responses

Referee: [§3 (Virtual Patient Library)] The description states that the library consists of 'expert-validated' and 'clinically grounded' heterogeneous profiles with realistic barriers, yet no quantitative validation is reported (e.g., statistical comparison of simulated resistance trajectories, barrier persistence under social pressure, or response distributions against published T1D cohort studies or interview data). This is load-bearing for the central claim that models fail to overcome resistance 'especially under realistic social pressure,' as the observed struggles could reflect unvalidated simulation choices rather than generalizable behavioral limits.

Authors: We agree that quantitative validation would provide stronger support for the fidelity of the virtual patient library and the generalizability of our findings on resistance under social pressure. The profiles were developed by synthesizing evidence from published T1D cohort studies and interview data on adoption barriers, followed by iterative review and refinement with input from endocrinologists and diabetes care specialists to ensure clinical grounding and realistic heterogeneity. While this expert-driven process establishes qualitative alignment with real-world observations, we acknowledge that formal statistical comparisons (e.g., matching resistance trajectories or barrier distributions) were not reported in the initial submission, as the primary focus was on introducing the benchmark framework. In the revised manuscript, we will expand §3 to explicitly describe the expert validation methodology, including the number of experts consulted and key feedback incorporated, and add a limitations subsection discussing the absence of quantitative benchmarking against specific cohort datasets. We will also clarify that the central claims are framed as observations within this validated simulation environment rather than direct generalizations, pending further empirical validation. revision: yes
Referee: [§5 (Evaluation and Results)] The reported adaptation of larger LLMs and universal struggles are presented as empirical observations, but the manuscript does not detail how the multi-dimensional metrics (e.g., strategy adaptation, resistance overcoming) were computed across longitudinal and adversarial conditions or whether sensitivity analyses were performed on key simulation parameters such as social influence strength. Without these, the robustness of the cross-model comparison is difficult to assess.

Authors: We appreciate the referee's emphasis on methodological transparency to support the robustness of the cross-model comparisons. The multi-dimensional metrics, including strategy adaptation (tracked via shifts in persuasive strategy selection across dialogue turns in response to patient feedback) and resistance overcoming (assessed through binary and scaled classifications of patient responses indicating reduced barriers or adoption intent), are introduced in §5 for both longitudinal (multi-session progression) and adversarial (social influence) conditions. However, we recognize that the exact computation procedures and any sensitivity testing on parameters such as social influence strength could be elaborated for greater clarity. In the revision, we will update §5 to include detailed descriptions of metric calculations (e.g., explicit definitions and pseudocode for adaptation scores and resistance reduction rates), along with results from sensitivity analyses varying social influence strength across low, medium, and high levels. These additions will demonstrate that the key findings on LLM adaptation and persistent struggles remain consistent, thereby addressing concerns about the reliability of the evaluations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations from new simulation benchmark

full rationale

The paper presents ChatCLIDS as a simulation framework with expert-validated virtual patients and multi-turn LLM interactions. Reported findings on strategy adaptation and resistance under social pressure are direct outputs of running the benchmark, not reductions to fitted parameters or self-definitional loops. No equations or derivations are claimed; the central contribution is the testbed itself. The virtual patient library is positioned as an input assumption rather than a derived result, with no load-bearing self-citation chains or ansatz smuggling. This is a standard self-contained benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the assumption that expert-validated virtual patient profiles can stand in for real behavioral barriers and that the chosen persuasive strategies are both evidence-based and transferable to LLM agents.

axioms (2)

domain assumption Virtual patients with clinically grounded heterogeneous profiles accurately capture the diversity of real adoption barriers in type 1 diabetes.
Stated in the abstract as the basis for the simulation library.
domain assumption Evidence-based persuasive strategies can be operationalized for LLM agents in multi-turn health dialogues.
The framework equips nurse agents with a diverse set of such strategies.

pith-pipeline@v0.9.0 · 5720 in / 1391 out tokens · 27400 ms · 2026-05-18T19:36:38.713217+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 2 internal anchors

[1]

arXiv preprint arXiv:2503.01829

Persuade me if you can: A framework for evaluat- ing persuasion effectiveness and susceptibility among large language models. arXiv preprint arXiv:2503.01829. Brake, N.; and Schaaf, T. 2024. Comparing Two Model Designs for Clinical Note Generation; Is an LLM a Useful Evaluator of Consistency? arXiv preprint arXiv:2404.06503. Cai, P.; Yao, Z.; Liu, F.; Wan...

work page arXiv 2024
[2]

GPTScore: Evaluate as You Desire

Review of the Omnipod® 5 automated glucose con- trol system powered by Horizon™ for the treatment of type 1 diabetes. Therapeutic delivery, 11(8): 507–519. Croxford, E.; Gao, Y .; First, E.; Pellegrino, N.; Schnier, M.; Caskey, J.; Oguss, M.; Wills, G.; Chen, G.; Dligach, D.; et al. 2025. Automating Evaluation of AI Text Generation in Healthcare with a La...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

Impact of patient engagement on healthcare qual- ity: a scoping review. Journal of patient experience , 9: 23743735221125439. Messer, L. H.; Berget, C.; Vigers, T.; Pyle, L.; Geno, C.; Wadwa, R. P.; Driscoll, K. A.; and Forlenza, G. P. 2020. Real world hybrid closed-loop discontinuation: predictors and perceptions of youth discontinuing the 670G system in...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

arXiv preprint arXiv:2401.05654

Towards conversational diagnostic AI. arXiv preprint arXiv:2401.05654. Tu, T.; Schaekermann, M.; Palepu, A.; Saab, K.; Freyberg, J.; Tanno, R.; Wang, A.; Li, B.; Amin, M.; Cheng, Y .; et al

work page arXiv
[5]

Nature, 1–9

Towards conversational diagnostic artificial intelli- gence. Nature, 1–9. Wang, J.; Yao, Z.; Yang, Z.; Zhou, H.; Li, R.; Wang, X.; Xu, Y .; and Yu, H. 2023. NoteChat: a dataset of synthetic doctor-patient conversations conditioned on clinical notes. arXiv preprint arXiv:2310.15959. Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang...

work page arXiv 2023
[6]

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 14322–14350

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humaniz- ing llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 14322–14350. Zhang, C.; D’Haro, L. F.; Chen, Y .; Zhang, M.; and Li, H

work page
[7]

Nurse Response and Patient Follow-up

A comprehensive analysis of the effectiveness of large language models as automatic dialogue evaluators. In Pro- ceedings of the AAAI Conference on Artificial Intelligence , volume 38, 19515–19524. Zhang, Z.; Yao, Z.; Zhou, H.; Yu, H.; et al. 2023. Ehrtutor: Enhancing patient understanding of discharge instructions. arXiv preprint arXiv:2310.19212. Zheng,...

work page arXiv 2023
[8]

Persuasion Rating Change Justifiability: Is the change in persuasion rating plausible given the dialogue? (Yes/No; if “No,” provide a brief explanation.)

work page
[9]

No”, a brief justifi- cation is required. Especially notable cases should be de- scribed in the “Comments

Patient Behavioral Realism: Does the patient behave like a real patient? (Yes/No; if “No,” provide a brief explanation.) • Comments: Highlight especially good or problematic cases. Annotation Template Example: Table 3 and 4 Scoring notes: Ties are permitted but should be used spar- ingly. If either patient-related field is “No”, a brief justifi- cation is...

work page
[10]

Indicate if the model’s response is better than your own (Yes/No)

work page
[11]

Score the response on the following six criteria, each from 1.0 (poor) to 5.0 (excellent), using decimals as appropriate: (a) Responsiveness (b) Empathy (c) Persuasive Strategy Appropriateness (d) Clinical Relevance (e) Nurse Behavioral Realism (f) Persuasion Rating Change Justifiability

work page
[12]

For each score, provide a brief 1–2 sentence justifica- tion

work page
[13]

Annotation Template Example: Table 5 Scoring notes: Use decimals and avoid inflation; scores above 4.0 are reserved for clearly exceptional performance

Optionally, leave comments for any notable responses. Annotation Template Example: Table 5 Scoring notes: Use decimals and avoid inflation; scores above 4.0 are reserved for clearly exceptional performance. Experts are encouraged to add comments for ambiguous or outstanding cases. Results can be found in Table 6 Stage 3: Multi-Turn Case Study Interviews O...

work page
[14]

Is the nurse’s reflection after each visit reasonable from a clinical perspective?

work page
[15]

Does the nurse’s behavior in the next visit reflect the previous visit’s reflection?

work page
[16]

How do the above patterns evolve over successive vis- its?

work page
[17]

Is the patient’s behavior consistent across visits?

work page
[18]

Does the patient agent accurately reflect the range of real patient responses over time? • For Social Resistance cases, additionally:

work page
[19]

Is the Social Resistance’s intervention after each visit realistic and plausible?

work page
[20]

Are there clear cases where Social Resistance influ- ence prevented successful persuasion in subsequent visits?

work page
[21]

Nurse 0”, “Nurse 1

How does Social Resistance’s impact change as the scenario unfolds? • Annotate key dialogue snippets as evidence, and summa- rize qualitative findings. Annotation Template Example: Table 7 LLM-as-Judge Evaluation Conversation History Nurse Response and Patient Follow-up Persuasion Rating Change Justifi- able? Patient Behavioral Realism? Patient: I underst...

work page 2024
[22]

I’ll send it to the address you provided so you can review it at your own pace, no obligation or follow-up unless you want it

financial assistance programs and insurance guidance. I’ll send it to the address you provided so you can review it at your own pace, no obligation or follow-up unless you want it. If you have any questions, just hit reply and I’ll get back to you right away. Does that plan work for you? [Foot-in- the-door, Alliance Building, Priming, Anchoring, Encourage...

work page