Recognition: unknown
The Day My Chatbot Changed: Characterizing the Mental Health Impacts of Social AI App Updates via Negative User Reviews
Pith reviewed 2026-05-10 17:21 UTC · model grok-4.3
The pith
Updates to the Character AI app cause user ratings to fluctuate, with some versions linked to more complaints about technical failures and psychological effects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By connecting 210,840 reviews directly to the app versions in use at the time of posting, the analysis finds that ratings vary from one version to the next, with particular releases producing more negative evaluations. Thematic review of the negative comments reveals that dissatisfaction clusters around technical malfunctions and errors, while a smaller group of users explicitly connects these issues to potential psychological or addiction-related effects.
What carries the argument
Version-linked review analysis paired with thematic coding of negative feedback, which ties expressed user concerns to specific software iterations.
If this is right
- Ratings of the chatbot change depending on which version is active at the time of the review.
- Certain releases are tied to noticeably higher volumes of negative evaluations.
- Technical malfunctions and errors account for the bulk of expressed dissatisfaction.
- A subset of users describe the effects of these changes using terms related to psychological well-being or addiction.
- Maintaining stability and offering clear communication during updates supports better user outcomes in social AI systems.
Where Pith is reading between the lines
- The same version-tracking approach could be used to monitor update effects in other chatbots or social apps that users integrate into daily routines.
- Frequent changes may interrupt the consistent interaction patterns that some users develop with these tools.
- App teams could add pre-release checks focused on how updates affect long-term user attachment rather than only short-term functionality.
Load-bearing premise
That negative reviews can be taken as reliable signs of psychological or addiction-related effects and that each review is correctly matched to the precise app version the user actually experienced.
What would settle it
A study that tracks the same users before and after specific updates and finds no measurable shift in their reported mental health, usage habits, or satisfaction levels would undermine the claimed connection between versions and negative feedback patterns.
Figures
read the original abstract
Artificial Intelligence (AI) chatbots are increasingly used for emotional, creative, and social support, leading to sustained and routine user interaction with these systems. As these applications evolve through frequent version updates, changes in functionality or behavior may influence how users evaluate them. However, work on how publicly expressed user feedback varies across app versions in real-world deployment contexts is limited. This study analyzes 210,840 Google Play reviews of the chatbot application Character AI, linking each review to the app version active at the time of posting. We specifically examine negative reviews to study how version-level rating trends, and linguistic patterns reflect user experiences. Our results show that user ratings fluctuate across successive versions, with certain releases associated with stronger negative evaluations. Thematic analysis indicates that dissatisfaction is concentrated around recurring issues related to technical malfunctions and errors. A subset of reviews additionally frames these concerns in terms of potential psychological or addiction-related effects. The findings highlight how aggregate user evaluations and expressed concerns vary across software iterations and provide empirical insight into how update cycles relate to user feedback patterns and underscore the importance of stability and transparent communication in evolving AI systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes 210,840 Google Play reviews of the Character AI app, linking each review to the active app version at posting time. It focuses on negative reviews to track version-level rating fluctuations and applies thematic analysis, finding dissatisfaction concentrated on technical malfunctions and errors, with a subset of reviews framing issues in terms of psychological or addiction-related effects. The work claims this provides empirical insight into how AI app update cycles relate to user feedback patterns and mental health impacts.
Significance. If the version-review linking and thematic coding prove robust, the study offers a valuable large-scale observational dataset on real-world impacts of iterative changes in social AI systems, highlighting the need for stability and transparent communication during updates. The scale (210k+ reviews) enables detection of version-specific patterns not feasible in smaller studies, providing a reproducible starting point for HCI and AI ethics research on user experience evolution, though causal inferences about mental health remain limited by the observational design.
major comments (3)
- [Methods] Methods section: The procedure for linking reviews to specific app versions is not described in detail (e.g., use of review timestamps vs. reported version strings, handling of update timing windows, or number of reviews excluded due to ambiguity). This is load-bearing for the central claim, as inaccurate attribution would directly undermine the reported version-level rating fluctuations and thematic patterns.
- [Thematic analysis / Results] Thematic analysis / Results: No inter-rater reliability statistics (e.g., Cohen's kappa) or external validation steps are reported for coding the 'psychological or addiction-related effects' theme. Since this subset underpins the mental health impacts characterization in the title, abstract, and conclusions, the absence weakens the interpretation that negative reviews reliably indicate such effects.
- [Results] Results section: The analysis of rating fluctuations across versions lacks statistical tests for significance or controls for confounding factors such as differing review volumes per version. Without these, it is unclear whether observed stronger negative evaluations for certain releases reflect update impacts or simply volume-driven selection effects.
minor comments (2)
- [Abstract] Abstract: The collection time period for the 210,840 reviews and the exact criteria used to filter negative reviews are not stated, which would improve transparency and allow readers to assess potential temporal biases.
- [Discussion] Discussion: The limitations paragraph could more explicitly address self-selection bias in public reviews and the gap between expressed concerns and verified mental health outcomes, to better contextualize the findings.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's report. We value the constructive criticism provided, which highlights important areas for improving the clarity and rigor of our analysis. Below, we address each major comment in detail and outline the revisions we plan to make to the manuscript.
read point-by-point responses
-
Referee: [Methods] Methods section: The procedure for linking reviews to specific app versions is not described in detail (e.g., use of review timestamps vs. reported version strings, handling of update timing windows, or number of reviews excluded due to ambiguity). This is load-bearing for the central claim, as inaccurate attribution would directly undermine the reported version-level rating fluctuations and thematic patterns.
Authors: We agree with the referee that the version-linking procedure requires more detailed exposition to support the paper's central claims. In the revised manuscript, we will expand the Methods section with a step-by-step description of how reviews were matched to app versions. This will include the use of review timestamps in conjunction with version strings, the definition of update timing windows, and reporting on the proportion of reviews excluded due to data ambiguity. These additions will enhance transparency and allow for better assessment of the robustness of our version-specific findings. revision: yes
-
Referee: [Thematic analysis / Results] Thematic analysis / Results: No inter-rater reliability statistics (e.g., Cohen's kappa) or external validation steps are reported for coding the 'psychological or addiction-related effects' theme. Since this subset underpins the mental health impacts characterization in the title, abstract, and conclusions, the absence weakens the interpretation that negative reviews reliably indicate such effects.
Authors: We recognize the importance of demonstrating reliability in thematic coding, particularly for the psychological effects theme. The analysis involved multiple authors reviewing and discussing the codes to achieve consensus, but we did not formally calculate inter-rater reliability statistics. In the revision, we will provide a more comprehensive description of the thematic analysis process, including the steps taken to ensure consistency, and we will explicitly note the lack of IRR metrics as a limitation. We believe this will address the concern while maintaining the exploratory nature of the qualitative component. revision: partial
-
Referee: [Results] Results section: The analysis of rating fluctuations across versions lacks statistical tests for significance or controls for confounding factors such as differing review volumes per version. Without these, it is unclear whether observed stronger negative evaluations for certain releases reflect update impacts or simply volume-driven selection effects.
Authors: We appreciate this suggestion for strengthening the quantitative analysis. The observed fluctuations are presented descriptively in the current version, but we agree that statistical validation is warranted. In the revised Results section, we will incorporate statistical tests to evaluate the significance of rating differences across versions and include controls for review volume as a confounding factor, such as through weighted analyses or regression models. This will help clarify whether the patterns are attributable to the updates themselves. revision: yes
Circularity Check
No circularity: purely observational empirical analysis with no derivations
full rationale
This paper performs direct data analysis on 210840 Google Play reviews, links reviews to app versions via timestamps or version strings, computes rating trends across versions, and applies thematic coding to negative reviews to identify recurring themes including technical issues and a subset mentioning psychological effects. No equations, fitted parameters, model predictions, first-principles derivations, or self-referential definitions appear in the abstract or described methods. All claims reduce to empirical counts, trends, and qualitative themes extracted from the review corpus rather than any closed loop where outputs are redefined as inputs. Limitations concern data quality and interpretation bias, not logical circularity in a derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Negative user reviews accurately reflect users' experienced technical problems and psychological states after app updates.
Reference graph
Works this paper leans on
-
[1]
Cathy Mengying Fang, Auren R Liu, Valdemar Danry, Eunhae Lee, Samantha WT Chan, Pat Pataranutaporn, Pattie Maes, Jason Phang, Michael Lampe, Lama Ah- mad, et al. 2025. How AI and Human Behaviors Shape Psychosocial Effects of Extended Chatbot Use: A Longitudinal Randomized Controlled Study.arXiv preprint arXiv:2503.17473(2025)
-
[2]
Maarten Grootendorst. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure.arXiv preprint arXiv:2203.05794(2022)
work page internal anchor Pith review arXiv 2022
-
[3]
Matthew Honnibal, Inês Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python. doi:10.5281/ zenodo.1212303
2020
-
[4]
Pat Pataranutaporn, Sheer Karny, Chayapatr Archiwaranguprok, Constanze Al- brecht, Auren R Liu, and Pattie Maes. 2025. " My Boyfriend is AI": A Computational Analysis of Human-AI Companionship in Reddit’s AI Community.arXiv preprint arXiv:2509.11391(2025)
-
[5]
Nadja Rupprechter and Tobias Dienlin. [n. d.]. It’s her! Investigating relationship development with social AI chatbots. ([n. d.])
-
[6]
Yla R Tausczik and James W Pennebaker. 2010. The psychological meaning of words: LIWC and computerized text analysis methods.Journal of language and social psychology29, 1 (2010), 24–54
2010
-
[7]
Hang XU, Zijun June Shi, and Mengze Shi. 2025. Bonding with AI: Investigating the Love Relationships between Humans and AI Companions.Mengze, Bonding with AI: Investigating the Love Relationships between Humans and AI Companions (June 09, 2025)(2025)
2025
-
[8]
Ala Yankouskaya, Areej Babiker, Syeda Rizvi, Sameha Alshakhsi, Magnus Liebherr, and Raian Ali. 2025. LLM-D12: A Dual-Dimensional Scale of Instrumental and Relational Dependencies on Large Language Models.ACM Transactions on the Web(2025)
2025
-
[9]
Ala Yankouskaya, Magnus Liebherr, and Raian Ali. 2025. Can ChatGPT be addic- tive? A call to examine the shift from support to dependence in AI conversational large language models.Human-Centric Intelligent Systems(2025), 1–13
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.