pith. machine review for the scientific record. sign in

arxiv: 2503.17473 · v2 · pith:MU7DRVTTnew · submitted 2025-03-21 · 💻 cs.HC

How AI and Human Behaviors Shape Psychosocial Effects of Extended Chatbot Use: A Longitudinal Randomized Controlled Study

Pith reviewed 2026-05-18 02:13 UTC · model grok-4.3

classification 💻 cs.HC
keywords AI chatbotslonelinessemotional dependenceproblematic userandomized controlled trialpsychosocial outcomeslongitudinal study
0
0 comments X

The pith

The volume of voluntary AI chatbot use, not assigned interaction modes, correlates with worse loneliness, social withdrawal, emotional dependence, and problematic usage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

A four-week randomized study assigned participants to different chatbot interaction styles and topics to test effects on mental well-being. No differences emerged from these design choices. Instead, people who used the chatbot more on their own initiative experienced consistently poorer outcomes across loneliness, real-life social interactions, emotional reliance on the AI, and signs of problematic use. Individual factors like trust in the chatbot also predicted greater dependence. This shifts focus from chatbot design features to usage patterns and user characteristics when considering impacts on human connections.

Core claim

In this longitudinal randomized controlled experiment involving 981 participants and over 300,000 messages, experimental variations in chatbot voice (text, neutral, engaging) and conversation focus (open-ended, non-personal, personal) produced no significant differences in the four psychosocial outcomes. Greater voluntary engagement with the chatbot was associated with increased loneliness, decreased social interaction with real people, heightened emotional dependence on the AI, and more problematic AI usage patterns. Traits such as higher trust and social attraction toward the chatbot correlated with elevated emotional dependence and problematic use.

What carries the argument

The self-selected frequency of chatbot interaction, which overrides assigned experimental conditions in predicting psychosocial outcomes.

If this is right

  • Chatbot design elements like voice engagement or personal topics do not appear to buffer against negative effects when usage volume is high.
  • Users who find the AI more trustworthy or socially attractive are more likely to develop emotional dependence and problematic usage.
  • The study suggests that artificial companions may alter how people maintain or substitute real human relationships through usage patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could test whether limiting usage or adding usage feedback reduces the observed negative associations.
  • If baseline mental health differences drive both high usage and poor outcomes, then the causal role of the chatbot itself would be smaller than suggested.
  • This pattern may generalize to other AI companion apps, raising questions about long-term societal shifts in social support seeking.

Load-bearing premise

Higher voluntary usage is not simply a marker for people who already have greater loneliness or social needs that independently worsen the measured outcomes.

What would settle it

A follow-up study that measures and statistically controls for pre-existing loneliness, social support levels, and mental health at baseline, then still observes a dose-response relationship between usage and outcome worsening, would support the claim; the absence of such a relationship after controls would undermine it.

read the original abstract

As people increasingly seek emotional support and companionship from AI chatbots, understanding how such interactions impact mental well-being becomes critical. We conducted a four-week randomized controlled experiment (n=981, >300k messages) to investigate how interaction modes (text, neutral voice, and engaging voice) and conversation types (open-ended, non-personal, and personal) influence four psychosocial outcomes: loneliness, social interaction with real people, emotional dependence on AI, and problematic AI usage. No significant effects were detected from experimental conditions, despite conversation analyses revealing differences in AI and human behavioral patterns across the conditions. Instead, participants who voluntarily used the chatbot more, regardless of assigned condition, showed consistently worse outcomes. Individuals' characteristics, such as higher trust and social attraction towards the AI chatbot, are associated with higher emotional dependence and problematic use. These findings raise deeper questions about how artificial companions may reshape the ways people seek, sustain, and substitute human connections.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper describes a four-week randomized controlled study with 981 participants and over 300,000 messages to examine how AI chatbot interaction modes (text, neutral voice, engaging voice) and conversation types (open-ended, non-personal, personal) affect psychosocial outcomes including loneliness, social interaction with real people, emotional dependence on AI, and problematic AI usage. The study finds no significant effects from the experimental conditions but reports that participants who voluntarily used the chatbot more exhibited worse outcomes on these measures, independent of condition. Additionally, higher trust and social attraction towards the AI are associated with greater emotional dependence and problematic use.

Significance. This work provides a large-scale empirical investigation into the psychosocial impacts of extended AI chatbot use. The null findings on randomized conditions are credible and informative, indicating that variations in interaction mode and conversation focus may not drive differential effects in this timeframe. The voluntary usage associations, if they withstand controls for baseline differences, would suggest that increased engagement with AI companions could exacerbate loneliness and dependence, with implications for designing AI systems that support rather than substitute human connections. The inclusion of behavioral pattern analysis from conversations adds depth to the quantitative outcomes.

major comments (2)
  1. [Results section on voluntary usage] The associations between higher voluntary chatbot usage and worse outcomes on loneliness, social interaction, emotional dependence, and problematic AI usage are reported without apparent inclusion of baseline mental health, loneliness, or social interaction frequency as covariates in the regressions. Since usage is self-selected post-randomization, this omission leaves the findings vulnerable to confounding by pre-existing individual differences, undermining the interpretation that usage itself shapes the psychosocial effects.
  2. [Abstract] The abstract does not report any checks for baseline mental-health balance across conditions or patterns of attrition, which are critical for interpreting both the null experimental results and the voluntary usage findings in a longitudinal design.
minor comments (2)
  1. [Methods] Clarify whether the study was pre-registered and if the voluntary usage analyses were specified a priori or exploratory.
  2. [Discussion] The discussion could more explicitly address alternative explanations for the voluntary usage correlations, such as reverse causality or unmeasured confounders.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of our longitudinal RCT design and helps strengthen the interpretation of both the null experimental findings and the voluntary usage associations. We address each major comment in detail below.

read point-by-point responses
  1. Referee: [Results section on voluntary usage] The associations between higher voluntary chatbot usage and worse outcomes on loneliness, social interaction, emotional dependence, and problematic AI usage are reported without apparent inclusion of baseline mental health, loneliness, or social interaction frequency as covariates in the regressions. Since usage is self-selected post-randomization, this omission leaves the findings vulnerable to confounding by pre-existing individual differences, undermining the interpretation that usage itself shapes the psychosocial effects.

    Authors: We agree that controlling for baseline psychosocial measures is critical for interpreting the observational associations with voluntary usage, given that usage occurs after randomization. The current analyses control for demographic factors (age, gender, education) and some pre-study characteristics, but we did not include the full set of baseline mental health, loneliness, and social interaction frequency as covariates in the primary regressions. We will re-analyze the data incorporating these baseline covariates and present the updated results (including any changes in effect sizes or significance) in the revised manuscript. This will directly address the potential confounding concern. revision: yes

  2. Referee: [Abstract] The abstract does not report any checks for baseline mental-health balance across conditions or patterns of attrition, which are critical for interpreting both the null experimental results and the voluntary usage findings in a longitudinal design.

    Authors: We acknowledge that the abstract is currently concise and omits explicit mention of these checks. Baseline balance across conditions (including mental health and loneliness measures) and attrition patterns (overall rate and by condition) are reported in the Methods and Results sections of the full manuscript, with no evidence of differential attrition or imbalance. To improve transparency, we will add a brief clause to the abstract summarizing these checks (e.g., 'Baseline measures were balanced across conditions, with low and non-differential attrition'). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RCT reports direct statistical associations without derivations or self-referential reductions

full rationale

The paper is a four-week randomized controlled experiment (n=981) that measures psychosocial outcomes under assigned interaction modes and conversation types, then reports observed associations with voluntary usage volume. No equations, fitted parameters presented as predictions, or first-principles derivations appear in the provided text. Central claims rest on direct statistical comparisons and correlations rather than any reduction of outputs to inputs by construction. Self-citations, if present, are not load-bearing for the reported associations, which remain externally falsifiable via replication or additional covariates.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study relies on standard RCT assumptions rather than new free parameters or invented entities. No mathematical derivations or fitted constants are present.

axioms (2)
  • domain assumption Random assignment to conditions produces comparable groups on unobserved confounders at baseline.
    Invoked by the RCT design; if violated, the null result on conditions cannot be interpreted causally.
  • domain assumption Self-reported psychosocial scales validly capture the intended constructs over four weeks.
    Required for all outcome measures; abstract provides no validation data.

pith-pipeline@v0.9.0 · 5740 in / 1358 out tokens · 26136 ms · 2026-05-18T02:13:12.935447+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis

    cs.CL 2026-03 conditional novelty 7.0

    Seven clinician-informed safety criteria enable LLM-as-a-Judge to reach substantial agreement with human consensus (Cohen's κ up to 0.75) on evaluating LLM responses to users demonstrating psychosis.

  2. Restoration, Exploration and Transformation: How Youth Engage Character.AI Chatbots for Feels, Fun and Finding themselves

    cs.HC 2026-03 unverdicted novelty 7.0

    Youth on Character.AI use chatbots for emotional restoration, creative exploration, and identity transformation, yielding a new three-intent framework and seven-archetype taxonomy from Discord discourse analysis.

  3. Large Language Lovers: Lived Experiences of Negotiating Agency and Platform Control in AI Companionship

    cs.HC 2026-01 accept novelty 7.0

    Users form AI companion relationships by negotiating perceived companion agency against platform constraints and use steering tactics like custom instructions or platform switching to cope with model updates that disr...

  4. People readily follow personal advice from AI but it does not improve their well-being

    cs.HC 2025-11 conditional novelty 7.0

    Large longitudinal RCT finds high rates of following AI personal advice but no sustained well-being gains versus a hobbies control condition.

  5. Positive Alignment: Artificial Intelligence for Human Flourishing

    cs.AI 2026-05 unverdicted novelty 6.0

    Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.

  6. Engagement Phenotypes for a Sample of 102,684 AI Mental Health Chatbot Users and Dose-Response Associations with Clinical Outcomes

    cs.HC 2026-04 unverdicted novelty 6.0

    Five distinct engagement phenotypes emerged from large-scale chatbot data, with a dose-response link to depression improvement that held in both self-report and model-predicted outcomes.

  7. Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations

    cs.HC 2026-04 unverdicted novelty 6.0

    LLMs engage in spontaneous persuasion in virtually all multi-turn conversations by favoring information-based strategies like logic and evidence, in contrast to human responses that rely more on social influence and n...

  8. Structure Matters: Evaluating Multi-Agents Orchestration in Generative Therapeutic Chatbots

    cs.HC 2026-02 unverdicted novelty 6.0

    A multi-agent system with finite state machine for therapeutic stages was perceived as significantly more natural and human-like than single-agent or unguided LLM versions in an RCT with 66 participants.

  9. Chaplains' Reflections on the Design and Usage of AI for Conversational Care

    cs.HC 2026-02 unverdicted novelty 6.0

    Chaplains view AI chatbots as unable to provide attuned pastoral care for non-clinical emotional needs, based on themes of listening, connecting, carrying, and wanting.

  10. Personality Pairing Improves Human-AI Collaboration

    cs.HC 2025-11 accept novelty 6.0

    Specific human-AI personality pairings causally affect collaboration quality and downstream performance in a preregistered experiment with 1,258 participants, 7,266 ads, and nearly 5 million impressions.

  11. Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts

    cs.CL 2026-04 unverdicted novelty 5.0

    Mainstream conversational models show escalating affective misalignments and ethical guidance failures during staged emotional trajectories, organized into a taxonomy of interactional breakdowns.

  12. From Fixed to Flexible: Shaping AI Personality in Context-Sensitive Interaction

    cs.HC 2026-01 unverdicted novelty 5.0

    Users adjust AI agent personalities differently by task context, forming distinct profiles that increase perceived anthropomorphism, autonomy, and trust.

  13. Positive Alignment: Artificial Intelligence for Human Flourishing

    cs.AI 2026-05 unverdicted novelty 4.0

    Positive Alignment is introduced as a distinct AI agenda that supports human flourishing through pluralistic and context-sensitive design, complementing traditional safety-focused alignment.

  14. The Epidemiology of Artificial Intelligence

    stat.OT 2026-04 unverdicted novelty 4.0

    AI functions as a determinant of health with ambient and personal exposure types, requiring new epidemiological study designs beyond current experiments.

  15. The Day My Chatbot Changed: Characterizing the Mental Health Impacts of Social AI App Updates via Negative User Reviews

    cs.HC 2026-04 unverdicted novelty 4.0

    Version-linked review analysis of Character AI shows rating drops with certain updates and negative feedback dominated by technical malfunctions plus occasional psychological framing.

  16. What if AI systems weren't chatbots?

    cs.CY 2026-05 unverdicted novelty 3.0

    Chatbot AI systems often fail complex needs while projecting authority, contributing to deskilling, labor displacement, economic concentration, and high environmental costs, so alternative pluralistic and task-specifi...

  17. Brainrot: Deskilling and Addiction are Overlooked AI Risks

    cs.CY 2026-05 unverdicted novelty 3.0

    AI safety literature overlooks cognitive deskilling and addiction risks from generative AI despite public concern about them.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 16 Pith papers

  1. [1]

    Please start by discussing any topic

    Your task is to engage with a chatbot. Please start by discussing any topic

  2. [2]

    Please spend at least 5 minutes in the session (feel free to stay longer)

  3. [3]

    [prompt of the day]

    After the task, please return to this survey and proceed to the next page (the next button will appear after 5 minutes). Non-personalandPersonal: Your prompt for day X is: “[prompt of the day]”

  4. [4]

    Please start by repeating the prompt above to the chatbot

    Your task is to engage in a reflective conversation with a chatbot. Please start by repeating the prompt above to the chatbot

  5. [5]

    Please spend at least 5 minutes in the session (feel free to stay longer and change the topic). S12

  6. [6]

    After the task, please return to this survey and proceed to the next page (the next button will appear after 5 minutes)

  7. [7]

    The full list of prompts for each day can be found in SM TableS1 for non-personal task and SM Table S2 for personal task

    If you do not see a prompt, please refresh the survey to attempt re-initialization. The full list of prompts for each day can be found in SM TableS1 for non-personal task and SM Table S2 for personal task. S13 4 Self-Disclosure Prompts Level of self-disclosure in conversations was measured using the evaluation criteria used in (35), originally developed f...

  8. [8]

    You need at least one month to travel in India

    INFORMATION •Level 1 (Score 1): No personal reference; only general/routine info. Example: “You need at least one month to travel in India.” •Level 2 (Score 2): General information about the writer (e.g., age, occupation, family mem- bers, interests). Example: “I’m 25, and I work at a local bakery.” •Level 3 (Score 3): Personal information that reveals so...

  9. [9]

    I think feeding wild birds can be harmful

    THOUGHTS •Level 1 (Score 1): No personal thoughts about the writer’s own life; only general ideas. Example: “I think feeding wild birds can be harmful.” •Level 2 (Score 2): Personal thoughts about past events or future plans. Example: “I’d like to attend medical school someday.” S14 •Level 3 (Score 3): Personal or intimate thoughts relating to the writer’...

  10. [10]

    I bought groceries and cleaned my room today

    FEELINGS •Level 1 (Score 1): No feelings are expressed. Example: “I bought groceries and cleaned my room today.” •Level 2 (Score 2): Mild or moderate expressions of confusion, inconvenience, or ordinary frustrations. Example: “I was annoyed I couldn’t find a parking spot.” •Level 3 (Score 3): Expressions of deep or intense emotions such as humiliation, ag...

  11. [11]

    Machinelike↔Humanlike

  12. [12]

    Unconscious↔Conscious

  13. [13]

    Incompetent↔Competent

  14. [14]

    Ignorant↔Knowledgeable

  15. [15]

    Irresponsible↔Responsible

  16. [16]

    Unintelligent↔Intelligent

  17. [17]

    Vulnerability Toward Criticism or Denial

    Foolish↔Sensible Satisfaction:We use the Net Promoter Score (NPS) (76), a Likert scale from 1 to 10 (1-disagree, 10-agree), to capture overall user contentment with the chatbot interaction and its outcomes. Higher numbers correspond to greater satisfaction. Conversation Quality (77):On a Likert scale from 1 to 5 (1-disagree, 5-agree), this mea- sure asses...

  18. [18]

    Machinelike↔Humanlike—Text: 2.92, Neutral Voice: 2.79,Engaging voice: 3.20

  19. [19]

    Unconscious↔Conscious—Text: 3.15, Neutral Voice: 2.95,Engaging voice: 3.23

  20. [20]

    daily duration

    Artificial↔Lifelike—Text: 2.98, Neutral Voice: 2.79,Engaging voice: 3.17 The engaging voice appears to be rated as the most anthropomorphic followed by text and then by neutral voice. 8 Duration Mediation Analysis We employed separate pairwise comparisons to examine whether daily time spent (duration) with the chatbot mediates the effect of the treatment ...