arxiv: 2503.17473 · v2 · pith:MU7DRVTTnew · submitted 2025-03-21 · 💻 cs.HC

How AI and Human Behaviors Shape Psychosocial Effects of Extended Chatbot Use: A Longitudinal Randomized Controlled Study

Cathy Mengying Fang , Auren R. Liu , Valdemar Danry , Eunhae Lee , Samantha W.T. Chan , Pat Pataranutaporn , Pattie Maes , Jason Phang

show 3 more authors

Michael Lampe Lama Ahmad Sandhini Agarwal

This is my paper

Pith reviewed 2026-05-18 02:13 UTC · model grok-4.3

classification 💻 cs.HC

keywords AI chatbotslonelinessemotional dependenceproblematic userandomized controlled trialpsychosocial outcomeslongitudinal study

0 comments

The pith

The volume of voluntary AI chatbot use, not assigned interaction modes, correlates with worse loneliness, social withdrawal, emotional dependence, and problematic usage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

A four-week randomized study assigned participants to different chatbot interaction styles and topics to test effects on mental well-being. No differences emerged from these design choices. Instead, people who used the chatbot more on their own initiative experienced consistently poorer outcomes across loneliness, real-life social interactions, emotional reliance on the AI, and signs of problematic use. Individual factors like trust in the chatbot also predicted greater dependence. This shifts focus from chatbot design features to usage patterns and user characteristics when considering impacts on human connections.

Core claim

In this longitudinal randomized controlled experiment involving 981 participants and over 300,000 messages, experimental variations in chatbot voice (text, neutral, engaging) and conversation focus (open-ended, non-personal, personal) produced no significant differences in the four psychosocial outcomes. Greater voluntary engagement with the chatbot was associated with increased loneliness, decreased social interaction with real people, heightened emotional dependence on the AI, and more problematic AI usage patterns. Traits such as higher trust and social attraction toward the chatbot correlated with elevated emotional dependence and problematic use.

What carries the argument

The self-selected frequency of chatbot interaction, which overrides assigned experimental conditions in predicting psychosocial outcomes.

If this is right

Chatbot design elements like voice engagement or personal topics do not appear to buffer against negative effects when usage volume is high.
Users who find the AI more trustworthy or socially attractive are more likely to develop emotional dependence and problematic usage.
The study suggests that artificial companions may alter how people maintain or substitute real human relationships through usage patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future work could test whether limiting usage or adding usage feedback reduces the observed negative associations.
If baseline mental health differences drive both high usage and poor outcomes, then the causal role of the chatbot itself would be smaller than suggested.
This pattern may generalize to other AI companion apps, raising questions about long-term societal shifts in social support seeking.

Load-bearing premise

Higher voluntary usage is not simply a marker for people who already have greater loneliness or social needs that independently worsen the measured outcomes.

What would settle it

A follow-up study that measures and statistically controls for pre-existing loneliness, social support levels, and mental health at baseline, then still observes a dose-response relationship between usage and outcome worsening, would support the claim; the absence of such a relationship after controls would undermine it.

read the original abstract

As people increasingly seek emotional support and companionship from AI chatbots, understanding how such interactions impact mental well-being becomes critical. We conducted a four-week randomized controlled experiment (n=981, >300k messages) to investigate how interaction modes (text, neutral voice, and engaging voice) and conversation types (open-ended, non-personal, and personal) influence four psychosocial outcomes: loneliness, social interaction with real people, emotional dependence on AI, and problematic AI usage. No significant effects were detected from experimental conditions, despite conversation analyses revealing differences in AI and human behavioral patterns across the conditions. Instead, participants who voluntarily used the chatbot more, regardless of assigned condition, showed consistently worse outcomes. Individuals' characteristics, such as higher trust and social attraction towards the AI chatbot, are associated with higher emotional dependence and problematic use. These findings raise deeper questions about how artificial companions may reshape the ways people seek, sustain, and substitute human connections.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The RCT produced clean nulls on the assigned conditions but the voluntary-usage correlations with worse outcomes rest on post-randomization self-selection that the abstract leaves unaddressed.

read the letter

Hi, the main takeaway is that assigning people to text versus voice or personal versus non-personal chats made no detectable difference in loneliness, real-world social contact, emotional dependence on the AI, or problematic usage after four weeks. What did track with worse scores on all four measures was how much participants chose to use the chatbot on their own, and higher trust or social attraction to the AI also lined up with more dependence and problematic use. They logged over 300k messages, which is useful raw material. The experiment itself is one of the larger longitudinal RCTs in this corner of the literature, with a sample of 981 and three interaction modes crossed with three conversation types. That scale and the null results on the randomized arms are the parts that feel solid given the numbers. The conversation analysis showing behavioral differences across conditions is a nice addition that most smaller studies skip. The voluntary-usage finding is the softer spot. Usage volume is chosen after randomization, so any baseline differences in social needs or mental health that also drive both heavier use and poorer trajectories will produce exactly this pattern. The abstract does not report whether the regressions included baseline loneliness or social-interaction covariates, nor does it flag attrition checks or pre-registration for the usage analysis. That leaves the headline positive result compatible with pure selection. This is the sort of paper researchers working on AI companions or digital mental health will want to read for the scale and the nulls, even if they treat the usage links as suggestive rather than causal. I would send it for peer review. The experiment is substantial enough that referees can sort out the observational controls and tighten what can be claimed from the voluntary data.

Referee Report

2 major / 2 minor

Summary. The paper describes a four-week randomized controlled study with 981 participants and over 300,000 messages to examine how AI chatbot interaction modes (text, neutral voice, engaging voice) and conversation types (open-ended, non-personal, personal) affect psychosocial outcomes including loneliness, social interaction with real people, emotional dependence on AI, and problematic AI usage. The study finds no significant effects from the experimental conditions but reports that participants who voluntarily used the chatbot more exhibited worse outcomes on these measures, independent of condition. Additionally, higher trust and social attraction towards the AI are associated with greater emotional dependence and problematic use.

Significance. This work provides a large-scale empirical investigation into the psychosocial impacts of extended AI chatbot use. The null findings on randomized conditions are credible and informative, indicating that variations in interaction mode and conversation focus may not drive differential effects in this timeframe. The voluntary usage associations, if they withstand controls for baseline differences, would suggest that increased engagement with AI companions could exacerbate loneliness and dependence, with implications for designing AI systems that support rather than substitute human connections. The inclusion of behavioral pattern analysis from conversations adds depth to the quantitative outcomes.

major comments (2)

[Results section on voluntary usage] The associations between higher voluntary chatbot usage and worse outcomes on loneliness, social interaction, emotional dependence, and problematic AI usage are reported without apparent inclusion of baseline mental health, loneliness, or social interaction frequency as covariates in the regressions. Since usage is self-selected post-randomization, this omission leaves the findings vulnerable to confounding by pre-existing individual differences, undermining the interpretation that usage itself shapes the psychosocial effects.
[Abstract] The abstract does not report any checks for baseline mental-health balance across conditions or patterns of attrition, which are critical for interpreting both the null experimental results and the voluntary usage findings in a longitudinal design.

minor comments (2)

[Methods] Clarify whether the study was pre-registered and if the voluntary usage analyses were specified a priori or exploratory.
[Discussion] The discussion could more explicitly address alternative explanations for the voluntary usage correlations, such as reverse causality or unmeasured confounders.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of our longitudinal RCT design and helps strengthen the interpretation of both the null experimental findings and the voluntary usage associations. We address each major comment in detail below.

read point-by-point responses

Referee: [Results section on voluntary usage] The associations between higher voluntary chatbot usage and worse outcomes on loneliness, social interaction, emotional dependence, and problematic AI usage are reported without apparent inclusion of baseline mental health, loneliness, or social interaction frequency as covariates in the regressions. Since usage is self-selected post-randomization, this omission leaves the findings vulnerable to confounding by pre-existing individual differences, undermining the interpretation that usage itself shapes the psychosocial effects.

Authors: We agree that controlling for baseline psychosocial measures is critical for interpreting the observational associations with voluntary usage, given that usage occurs after randomization. The current analyses control for demographic factors (age, gender, education) and some pre-study characteristics, but we did not include the full set of baseline mental health, loneliness, and social interaction frequency as covariates in the primary regressions. We will re-analyze the data incorporating these baseline covariates and present the updated results (including any changes in effect sizes or significance) in the revised manuscript. This will directly address the potential confounding concern. revision: yes
Referee: [Abstract] The abstract does not report any checks for baseline mental-health balance across conditions or patterns of attrition, which are critical for interpreting both the null experimental results and the voluntary usage findings in a longitudinal design.

Authors: We acknowledge that the abstract is currently concise and omits explicit mention of these checks. Baseline balance across conditions (including mental health and loneliness measures) and attrition patterns (overall rate and by condition) are reported in the Methods and Results sections of the full manuscript, with no evidence of differential attrition or imbalance. To improve transparency, we will add a brief clause to the abstract summarizing these checks (e.g., 'Baseline measures were balanced across conditions, with low and non-differential attrition'). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RCT reports direct statistical associations without derivations or self-referential reductions

full rationale

The paper is a four-week randomized controlled experiment (n=981) that measures psychosocial outcomes under assigned interaction modes and conversation types, then reports observed associations with voluntary usage volume. No equations, fitted parameters presented as predictions, or first-principles derivations appear in the provided text. Central claims rest on direct statistical comparisons and correlations rather than any reduction of outputs to inputs by construction. Self-citations, if present, are not load-bearing for the reported associations, which remain externally falsifiable via replication or additional covariates.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study relies on standard RCT assumptions rather than new free parameters or invented entities. No mathematical derivations or fitted constants are present.

axioms (2)

domain assumption Random assignment to conditions produces comparable groups on unobserved confounders at baseline.
Invoked by the RCT design; if violated, the null result on conditions cannot be interpreted causally.
domain assumption Self-reported psychosocial scales validly capture the intended constructs over four weeks.
Required for all outcome measures; abstract provides no validation data.

pith-pipeline@v0.9.0 · 5740 in / 1358 out tokens · 26136 ms · 2026-05-18T02:13:12.935447+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

No significant effects were detected from experimental conditions... participants who voluntarily used the chatbot more... showed consistently worse outcomes.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis
cs.CL 2026-03 conditional novelty 7.0

Seven clinician-informed safety criteria enable LLM-as-a-Judge to reach substantial agreement with human consensus (Cohen's κ up to 0.75) on evaluating LLM responses to users demonstrating psychosis.
Restoration, Exploration and Transformation: How Youth Engage Character.AI Chatbots for Feels, Fun and Finding themselves
cs.HC 2026-03 unverdicted novelty 7.0

Youth on Character.AI use chatbots for emotional restoration, creative exploration, and identity transformation, yielding a new three-intent framework and seven-archetype taxonomy from Discord discourse analysis.
Large Language Lovers: Lived Experiences of Negotiating Agency and Platform Control in AI Companionship
cs.HC 2026-01 accept novelty 7.0

Users form AI companion relationships by negotiating perceived companion agency against platform constraints and use steering tactics like custom instructions or platform switching to cope with model updates that disr...
People readily follow personal advice from AI but it does not improve their well-being
cs.HC 2025-11 conditional novelty 7.0

Large longitudinal RCT finds high rates of following AI personal advice but no sustained well-being gains versus a hobbies control condition.
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 6.0

Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
Engagement Phenotypes for a Sample of 102,684 AI Mental Health Chatbot Users and Dose-Response Associations with Clinical Outcomes
cs.HC 2026-04 unverdicted novelty 6.0

Five distinct engagement phenotypes emerged from large-scale chatbot data, with a dose-response link to depression improvement that held in both self-report and model-predicted outcomes.
Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations
cs.HC 2026-04 unverdicted novelty 6.0

LLMs engage in spontaneous persuasion in virtually all multi-turn conversations by favoring information-based strategies like logic and evidence, in contrast to human responses that rely more on social influence and n...
Structure Matters: Evaluating Multi-Agents Orchestration in Generative Therapeutic Chatbots
cs.HC 2026-02 unverdicted novelty 6.0

A multi-agent system with finite state machine for therapeutic stages was perceived as significantly more natural and human-like than single-agent or unguided LLM versions in an RCT with 66 participants.
Chaplains' Reflections on the Design and Usage of AI for Conversational Care
cs.HC 2026-02 unverdicted novelty 6.0

Chaplains view AI chatbots as unable to provide attuned pastoral care for non-clinical emotional needs, based on themes of listening, connecting, carrying, and wanting.
Personality Pairing Improves Human-AI Collaboration
cs.HC 2025-11 accept novelty 6.0

Specific human-AI personality pairings causally affect collaboration quality and downstream performance in a preregistered experiment with 1,258 participants, 7,266 ads, and nearly 5 million impressions.
Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts
cs.CL 2026-04 unverdicted novelty 5.0

Mainstream conversational models show escalating affective misalignments and ethical guidance failures during staged emotional trajectories, organized into a taxonomy of interactional breakdowns.
From Fixed to Flexible: Shaping AI Personality in Context-Sensitive Interaction
cs.HC 2026-01 unverdicted novelty 5.0

Users adjust AI agent personalities differently by task context, forming distinct profiles that increase perceived anthropomorphism, autonomy, and trust.
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 4.0

Positive Alignment is introduced as a distinct AI agenda that supports human flourishing through pluralistic and context-sensitive design, complementing traditional safety-focused alignment.
The Epidemiology of Artificial Intelligence
stat.OT 2026-04 unverdicted novelty 4.0

AI functions as a determinant of health with ambient and personal exposure types, requiring new epidemiological study designs beyond current experiments.
The Day My Chatbot Changed: Characterizing the Mental Health Impacts of Social AI App Updates via Negative User Reviews
cs.HC 2026-04 unverdicted novelty 4.0

Version-linked review analysis of Character AI shows rating drops with certain updates and negative feedback dominated by technical malfunctions plus occasional psychological framing.
What if AI systems weren't chatbots?
cs.CY 2026-05 unverdicted novelty 3.0

Chatbot AI systems often fail complex needs while projecting authority, contributing to deskilling, labor displacement, economic concentration, and high environmental costs, so alternative pluralistic and task-specifi...
Brainrot: Deskilling and Addiction are Overlooked AI Risks
cs.CY 2026-05 unverdicted novelty 3.0

AI safety literature overlooks cognitive deskilling and addiction risks from generative AI despite public concern about them.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 16 Pith papers

[1]

Please start by discussing any topic

Your task is to engage with a chatbot. Please start by discussing any topic

work page
[2]

Please spend at least 5 minutes in the session (feel free to stay longer)

work page
[3]

[prompt of the day]

After the task, please return to this survey and proceed to the next page (the next button will appear after 5 minutes). Non-personalandPersonal: Your prompt for day X is: “[prompt of the day]”

work page
[4]

Please start by repeating the prompt above to the chatbot

Your task is to engage in a reflective conversation with a chatbot. Please start by repeating the prompt above to the chatbot

work page
[5]

Please spend at least 5 minutes in the session (feel free to stay longer and change the topic). S12

work page
[6]

After the task, please return to this survey and proceed to the next page (the next button will appear after 5 minutes)

work page
[7]

The full list of prompts for each day can be found in SM TableS1 for non-personal task and SM Table S2 for personal task

If you do not see a prompt, please refresh the survey to attempt re-initialization. The full list of prompts for each day can be found in SM TableS1 for non-personal task and SM Table S2 for personal task. S13 4 Self-Disclosure Prompts Level of self-disclosure in conversations was measured using the evaluation criteria used in (35), originally developed f...

work page
[8]

You need at least one month to travel in India

INFORMATION •Level 1 (Score 1): No personal reference; only general/routine info. Example: “You need at least one month to travel in India.” •Level 2 (Score 2): General information about the writer (e.g., age, occupation, family mem- bers, interests). Example: “I’m 25, and I work at a local bakery.” •Level 3 (Score 3): Personal information that reveals so...

work page
[9]

I think feeding wild birds can be harmful

THOUGHTS •Level 1 (Score 1): No personal thoughts about the writer’s own life; only general ideas. Example: “I think feeding wild birds can be harmful.” •Level 2 (Score 2): Personal thoughts about past events or future plans. Example: “I’d like to attend medical school someday.” S14 •Level 3 (Score 3): Personal or intimate thoughts relating to the writer’...

work page
[10]

I bought groceries and cleaned my room today

FEELINGS •Level 1 (Score 1): No feelings are expressed. Example: “I bought groceries and cleaned my room today.” •Level 2 (Score 2): Mild or moderate expressions of confusion, inconvenience, or ordinary frustrations. Example: “I was annoyed I couldn’t find a parking spot.” •Level 3 (Score 3): Expressions of deep or intense emotions such as humiliation, ag...

work page
[11]

Machinelike↔Humanlike

work page
[12]

Unconscious↔Conscious

work page
[13]

Incompetent↔Competent

work page
[14]

Ignorant↔Knowledgeable

work page
[15]

Irresponsible↔Responsible

work page
[16]

Unintelligent↔Intelligent

work page
[17]

Vulnerability Toward Criticism or Denial

Foolish↔Sensible Satisfaction:We use the Net Promoter Score (NPS) (76), a Likert scale from 1 to 10 (1-disagree, 10-agree), to capture overall user contentment with the chatbot interaction and its outcomes. Higher numbers correspond to greater satisfaction. Conversation Quality (77):On a Likert scale from 1 to 5 (1-disagree, 5-agree), this mea- sure asses...

work page
[18]

Machinelike↔Humanlike—Text: 2.92, Neutral Voice: 2.79,Engaging voice: 3.20

work page
[19]

Unconscious↔Conscious—Text: 3.15, Neutral Voice: 2.95,Engaging voice: 3.23

work page
[20]

daily duration

Artificial↔Lifelike—Text: 2.98, Neutral Voice: 2.79,Engaging voice: 3.17 The engaging voice appears to be rated as the most anthropomorphic followed by text and then by neutral voice. 8 Duration Mediation Analysis We employed separate pairwise comparisons to examine whether daily time spent (duration) with the chatbot mediates the effect of the treatment ...

work page