arxiv: 2604.07548 · v1 · submitted 2026-04-08 · 💻 cs.HC

Recognition: unknown

The Day My Chatbot Changed: Characterizing the Mental Health Impacts of Social AI App Updates via Negative User Reviews

Sirajam Munira , Lydia Manikonda

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:21 UTC · model grok-4.3

classification 💻 cs.HC

keywords AI chatbotsuser reviewsapp updatesmental health impactsthematic analysisCharacter AInegative feedbacksocial AI

0 comments

The pith

Updates to the Character AI app cause user ratings to fluctuate, with some versions linked to more complaints about technical failures and psychological effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines over 210,000 Google Play reviews of the Character AI chatbot, matching each one to the exact app version active when posted. It shows that ratings shift across versions and that certain releases draw stronger negative responses. Most complaints focus on technical malfunctions and errors, but some reviews describe these changes in terms of mental health or addiction concerns. This matters because users often turn to these apps for ongoing emotional and social support, so version changes can alter their daily experience in noticeable ways.

Core claim

By connecting 210,840 reviews directly to the app versions in use at the time of posting, the analysis finds that ratings vary from one version to the next, with particular releases producing more negative evaluations. Thematic review of the negative comments reveals that dissatisfaction clusters around technical malfunctions and errors, while a smaller group of users explicitly connects these issues to potential psychological or addiction-related effects.

What carries the argument

Version-linked review analysis paired with thematic coding of negative feedback, which ties expressed user concerns to specific software iterations.

If this is right

Ratings of the chatbot change depending on which version is active at the time of the review.
Certain releases are tied to noticeably higher volumes of negative evaluations.
Technical malfunctions and errors account for the bulk of expressed dissatisfaction.
A subset of users describe the effects of these changes using terms related to psychological well-being or addiction.
Maintaining stability and offering clear communication during updates supports better user outcomes in social AI systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same version-tracking approach could be used to monitor update effects in other chatbots or social apps that users integrate into daily routines.
Frequent changes may interrupt the consistent interaction patterns that some users develop with these tools.
App teams could add pre-release checks focused on how updates affect long-term user attachment rather than only short-term functionality.

Load-bearing premise

That negative reviews can be taken as reliable signs of psychological or addiction-related effects and that each review is correctly matched to the precise app version the user actually experienced.

What would settle it

A study that tracks the same users before and after specific updates and finds no measurable shift in their reported mental health, usage habits, or satisfaction levels would undermine the claimed connection between versions and negative feedback patterns.

Figures

Figures reproduced from arXiv: 2604.07548 by Lydia Manikonda, Sirajam Munira.

**Figure 3.** Figure 3: Topic-based visualization of themes expressed in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

Artificial Intelligence (AI) chatbots are increasingly used for emotional, creative, and social support, leading to sustained and routine user interaction with these systems. As these applications evolve through frequent version updates, changes in functionality or behavior may influence how users evaluate them. However, work on how publicly expressed user feedback varies across app versions in real-world deployment contexts is limited. This study analyzes 210,840 Google Play reviews of the chatbot application Character AI, linking each review to the app version active at the time of posting. We specifically examine negative reviews to study how version-level rating trends, and linguistic patterns reflect user experiences. Our results show that user ratings fluctuate across successive versions, with certain releases associated with stronger negative evaluations. Thematic analysis indicates that dissatisfaction is concentrated around recurring issues related to technical malfunctions and errors. A subset of reviews additionally frames these concerns in terms of potential psychological or addiction-related effects. The findings highlight how aggregate user evaluations and expressed concerns vary across software iterations and provide empirical insight into how update cycles relate to user feedback patterns and underscore the importance of stability and transparent communication in evolving AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper analyzes 210,840 Google Play reviews of the Character AI app, linking each review to the active app version at posting time. It focuses on negative reviews to track version-level rating fluctuations and applies thematic analysis, finding dissatisfaction concentrated on technical malfunctions and errors, with a subset of reviews framing issues in terms of psychological or addiction-related effects. The work claims this provides empirical insight into how AI app update cycles relate to user feedback patterns and mental health impacts.

Significance. If the version-review linking and thematic coding prove robust, the study offers a valuable large-scale observational dataset on real-world impacts of iterative changes in social AI systems, highlighting the need for stability and transparent communication during updates. The scale (210k+ reviews) enables detection of version-specific patterns not feasible in smaller studies, providing a reproducible starting point for HCI and AI ethics research on user experience evolution, though causal inferences about mental health remain limited by the observational design.

major comments (3)

[Methods] Methods section: The procedure for linking reviews to specific app versions is not described in detail (e.g., use of review timestamps vs. reported version strings, handling of update timing windows, or number of reviews excluded due to ambiguity). This is load-bearing for the central claim, as inaccurate attribution would directly undermine the reported version-level rating fluctuations and thematic patterns.
[Thematic analysis / Results] Thematic analysis / Results: No inter-rater reliability statistics (e.g., Cohen's kappa) or external validation steps are reported for coding the 'psychological or addiction-related effects' theme. Since this subset underpins the mental health impacts characterization in the title, abstract, and conclusions, the absence weakens the interpretation that negative reviews reliably indicate such effects.
[Results] Results section: The analysis of rating fluctuations across versions lacks statistical tests for significance or controls for confounding factors such as differing review volumes per version. Without these, it is unclear whether observed stronger negative evaluations for certain releases reflect update impacts or simply volume-driven selection effects.

minor comments (2)

[Abstract] Abstract: The collection time period for the 210,840 reviews and the exact criteria used to filter negative reviews are not stated, which would improve transparency and allow readers to assess potential temporal biases.
[Discussion] Discussion: The limitations paragraph could more explicitly address self-selection bias in public reviews and the gap between expressed concerns and verified mental health outcomes, to better contextualize the findings.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We value the constructive criticism provided, which highlights important areas for improving the clarity and rigor of our analysis. Below, we address each major comment in detail and outline the revisions we plan to make to the manuscript.

read point-by-point responses

Referee: [Methods] Methods section: The procedure for linking reviews to specific app versions is not described in detail (e.g., use of review timestamps vs. reported version strings, handling of update timing windows, or number of reviews excluded due to ambiguity). This is load-bearing for the central claim, as inaccurate attribution would directly undermine the reported version-level rating fluctuations and thematic patterns.

Authors: We agree with the referee that the version-linking procedure requires more detailed exposition to support the paper's central claims. In the revised manuscript, we will expand the Methods section with a step-by-step description of how reviews were matched to app versions. This will include the use of review timestamps in conjunction with version strings, the definition of update timing windows, and reporting on the proportion of reviews excluded due to data ambiguity. These additions will enhance transparency and allow for better assessment of the robustness of our version-specific findings. revision: yes
Referee: [Thematic analysis / Results] Thematic analysis / Results: No inter-rater reliability statistics (e.g., Cohen's kappa) or external validation steps are reported for coding the 'psychological or addiction-related effects' theme. Since this subset underpins the mental health impacts characterization in the title, abstract, and conclusions, the absence weakens the interpretation that negative reviews reliably indicate such effects.

Authors: We recognize the importance of demonstrating reliability in thematic coding, particularly for the psychological effects theme. The analysis involved multiple authors reviewing and discussing the codes to achieve consensus, but we did not formally calculate inter-rater reliability statistics. In the revision, we will provide a more comprehensive description of the thematic analysis process, including the steps taken to ensure consistency, and we will explicitly note the lack of IRR metrics as a limitation. We believe this will address the concern while maintaining the exploratory nature of the qualitative component. revision: partial
Referee: [Results] Results section: The analysis of rating fluctuations across versions lacks statistical tests for significance or controls for confounding factors such as differing review volumes per version. Without these, it is unclear whether observed stronger negative evaluations for certain releases reflect update impacts or simply volume-driven selection effects.

Authors: We appreciate this suggestion for strengthening the quantitative analysis. The observed fluctuations are presented descriptively in the current version, but we agree that statistical validation is warranted. In the revised Results section, we will incorporate statistical tests to evaluate the significance of rating differences across versions and include controls for review volume as a confounding factor, such as through weighted analyses or regression models. This will help clarify whether the patterns are attributable to the updates themselves. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical analysis with no derivations

full rationale

This paper performs direct data analysis on 210840 Google Play reviews, links reviews to app versions via timestamps or version strings, computes rating trends across versions, and applies thematic coding to negative reviews to identify recurring themes including technical issues and a subset mentioning psychological effects. No equations, fitted parameters, model predictions, first-principles derivations, or self-referential definitions appear in the abstract or described methods. All claims reduce to empirical counts, trends, and qualitative themes extracted from the review corpus rather than any closed loop where outputs are redefined as inputs. Limitations concern data quality and interpretation bias, not logical circularity in a derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the interpretive step that review language can indicate mental health impacts and on the assumption that public reviews are representative without systematic bias in who posts or what they say.

axioms (1)

domain assumption Negative user reviews accurately reflect users' experienced technical problems and psychological states after app updates.
Invoked when the paper moves from review text to claims about mental health impacts.

pith-pipeline@v0.9.0 · 5501 in / 1300 out tokens · 39039 ms · 2026-05-10T17:21:46.208609+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Cathy Mengying Fang, Auren R Liu, Valdemar Danry, Eunhae Lee, Samantha WT Chan, Pat Pataranutaporn, Pattie Maes, Jason Phang, Michael Lampe, Lama Ah- mad, et al. 2025. How AI and Human Behaviors Shape Psychosocial Effects of Extended Chatbot Use: A Longitudinal Randomized Controlled Study.arXiv preprint arXiv:2503.17473(2025)

work page arXiv 2025
[2]

Maarten Grootendorst. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure.arXiv preprint arXiv:2203.05794(2022)

work page internal anchor Pith review arXiv 2022
[3]

Matthew Honnibal, Inês Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python. doi:10.5281/ zenodo.1212303

2020
[4]

My Boyfriend is AI

Pat Pataranutaporn, Sheer Karny, Chayapatr Archiwaranguprok, Constanze Al- brecht, Auren R Liu, and Pattie Maes. 2025. " My Boyfriend is AI": A Computational Analysis of Human-AI Companionship in Reddit’s AI Community.arXiv preprint arXiv:2509.11391(2025)

work page arXiv 2025
[5]

Nadja Rupprechter and Tobias Dienlin. [n. d.]. It’s her! Investigating relationship development with social AI chatbots. ([n. d.])
[6]

Yla R Tausczik and James W Pennebaker. 2010. The psychological meaning of words: LIWC and computerized text analysis methods.Journal of language and social psychology29, 1 (2010), 24–54

2010
[7]

Hang XU, Zijun June Shi, and Mengze Shi. 2025. Bonding with AI: Investigating the Love Relationships between Humans and AI Companions.Mengze, Bonding with AI: Investigating the Love Relationships between Humans and AI Companions (June 09, 2025)(2025)

2025
[8]

Ala Yankouskaya, Areej Babiker, Syeda Rizvi, Sameha Alshakhsi, Magnus Liebherr, and Raian Ali. 2025. LLM-D12: A Dual-Dimensional Scale of Instrumental and Relational Dependencies on Large Language Models.ACM Transactions on the Web(2025)

2025
[9]

Ala Yankouskaya, Magnus Liebherr, and Raian Ali. 2025. Can ChatGPT be addic- tive? A call to examine the shift from support to dependence in AI conversational large language models.Human-Centric Intelligent Systems(2025), 1–13

2025