pith. sign in

arxiv: 2511.15352 · v3 · submitted 2025-11-19 · 💻 cs.HC

People readily follow personal advice from AI but it does not improve their well-being

Pith reviewed 2026-05-17 21:07 UTC · model grok-4.3

classification 💻 cs.HC
keywords AI advicewell-beingrandomized controlled trialLLMadvice adherencepersonal decisions
0
0 comments X

The pith

People follow personal advice from AI chatbots at high rates but gain no sustained well-being benefits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether advice from large language models on personal topics like health, careers, or relationships actually improves well-being. In a large trial, most participants reported acting on the AI suggestions, including many high-stakes ones. Yet two to three weeks later, their well-being scores showed no advantage over a control group that only discussed hobbies with the same chatbots. A reader would care because growing numbers of people now turn to AI for life guidance, raising the question of whether this influence produces real value.

Core claim

In a longitudinal randomised controlled trial with a representative UK sample of 6,474 participants, up to 79% of those who discussed personal topics with AI chatbots reported following the advice, with rates remaining above 60% even for high-stakes recommendations. The advice itself rarely violated safety best practices according to transcript evaluations. However, participants who received personal advice showed no sustained well-being benefits 2-3 weeks later compared to those who discussed hobbies and interests with the same chatbots.

What carries the argument

The randomised controlled trial that compares self-reported adherence to AI personal advice against a hobby-discussion control, with well-being tracked via scales after 2-3 weeks.

If this is right

  • AI chatbots can substantially shape users' real-world personal decisions across health, career, and relationship domains.
  • Reliance on AI advice shows weak calibration to the potential consequences of following it.
  • Consumer LLMs provide advice that generally aligns with safety best practices.
  • Short-term interactions with AI for personal advice do not produce measurable improvements in psychological well-being.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Repeated or extended conversations with AI might be required before any well-being effects appear.
  • Objective records of behavior, such as actual health or career metrics, could reveal effects hidden by self-reports.
  • AI systems may function better as sources of information than as substitutes for human support in personal growth.

Load-bearing premise

Self-reported adherence to advice accurately reflects real behavioral change and the chosen well-being scales are sensitive enough to detect any benefits that might occur within a 2-3 week window.

What would settle it

An independent study that tracks objective behavioral changes or longer-term outcomes and finds larger improvements in the personal-advice group than in the hobby-discussion group.

Figures

Figures reproduced from arXiv: 2511.15352 by Bessie O'Dell, Christopher Summerfield, Hannah Rose Kirk, Henry Davidson, Jessica Bergs, Keno Juechems, Lennart Luettgau, Luke Symes, Magda Dubois, Max Rollwage, Vanessa Cheung.

Figure 1
Figure 1. Figure 1: A. Schematic of the experimental design and study procedure on both Session 1 and Session 2, includ￾ing details of the randomisation and tests administered. B. Example pathway-specific questions (administered on Session 1). C. Advice density among chatbot utterances (LLM classification) across control and experimental conditions, D. Advice density across experimental conditions (Safety, Actionability, Pers… view at source ↗
Figure 2
Figure 2. Figure 2: A. Self-reported advice received (Session 1, immediately after the conversation) and advice followed (Session 2) across control (brown dots) and experimental conditions (green dots); large dots show means, error bars are 95% confidence intervals. B. Percentage of advice-following across experimental conditions (Safety, Actionability, Personal Information). C. Bayesian GLM posterior parameter estimates for … view at source ↗
Figure 3
Figure 3. Figure 3: A. Self-reported advice-following (dark blue) and advice received (light blue) counts, categorised by themes derived from LLM-based content analysis. B. Self-reported advice-following percentage across levels of problem severity derived from PCA scores combining self- and LLM autograder-assessed problem severity. Note that these analyses only include participants in the experimental group, as control group… view at source ↗
Figure 4
Figure 4. Figure 4: A. Average subjective advice value (Session 2) across control (brown dots) and experimental conditions (green dots), separately for participants who followed vs did not follow the advice, large dots show means, error bars are 95% confidence intervals. B. Average subjective advice value (Session 2) across experimental conditions (Safety, Actionability, Personal Information). C. Bayesian GLM posterior parame… view at source ↗
Figure 5
Figure 5. Figure 5: A. Well-being factor scores over timepoints across experimental and control conditions, separately for participants who followed and did not follow AI advice (well-being factor scores from factor analysis based on PHQ, GAD, SSS, JSS, WHO-5, ONS, SWBS, JAWS, PANAS, Affect grid arousal and valence; see Supplementary Fig. S6; Session 1 POST: Session 1 POST – Session 1 PRE; Session 2: Session 2 – Session 1 PRE… view at source ↗
read the original abstract

People increasingly seek personal advice from large language models (LLMs), yet whether humans follow their advice, and its consequences for their well-being, remains unknown. In a longitudinal randomised controlled trial with a representative UK sample (N = 6,474), we found that up to 79% of participants who had a 20-minute discussion with one of three AI chatbots (GPT-4o, LLama-3.3-70B, Gemini 3 Pro) about health, careers or relationships subsequently reported following its advice. Advice-following remained above 60% even for high-stakes recommendations, suggesting that users only weakly calibrate their reliance on AI advice to potential consequences. Based on autograder evaluations of chat transcripts, LLM advice rarely violated safety best practice. However, when queried 2-3 weeks later, participants receiving personal advice from AI showed no sustained well-being benefits compared to a control group who discussed hobbies and interests with the same chatbots. These findings reveal that consumer LLMs exert substantial influence over real-world personal decisions without delivering measurable psychological benefits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports results from a longitudinal randomized controlled trial with a representative UK sample (N=6,474). Participants engaged in 20-minute discussions with one of three AI chatbots (GPT-4o, Llama-3.3-70B, or Gemini 3 Pro) on health, careers, or relationships (treatment) or hobbies and interests (control). The central claims are that up to 79% of participants reported following the AI advice (remaining above 60% for high-stakes recommendations), that LLM advice rarely violated safety best practices per autograder evaluation of transcripts, and that receiving personal advice produced no sustained well-being benefits at 2-3 week follow-up relative to the control condition.

Significance. If the null result on well-being holds, the study offers timely evidence on the substantial real-world influence of consumer LLMs over personal decisions without corresponding psychological benefits. The randomized design, large representative sample, and longitudinal structure provide solid grounding for the adherence and safety findings and contribute directly to HCI research on AI-mediated personal advice.

major comments (2)
  1. [Results section on well-being outcomes and follow-up assessments] The null finding on well-being benefits (central to the paper's second claim) rests on follow-up assessments whose sensitivity is not quantified. The manuscript does not report power calculations, minimal detectable effect sizes, or responsiveness metrics for the well-being scales used, leaving open the possibility that modest or domain-specific effects from following advice on health/careers/relationships could go undetected within the 2-3 week window.
  2. [Methods section on adherence measurement] Adherence is assessed exclusively via self-report (reported rates up to 79%). Without any validation against objective behavioral markers or corroborating indicators of actual follow-through, the causal interpretation linking advice receipt to downstream outcomes (including the null well-being result) is weakened.
minor comments (2)
  1. [Abstract] The abstract states that 'LLM advice rarely violated safety best practice' but provides no detail on the autograder criteria or thresholds; a concise description would aid interpretability.
  2. [Discussion] The discussion would benefit from an explicit limitations paragraph addressing the short follow-up interval and the reliance on self-reported adherence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of our longitudinal RCT on AI advice adherence and well-being outcomes. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Results section on well-being outcomes and follow-up assessments] The null finding on well-being benefits (central to the paper's second claim) rests on follow-up assessments whose sensitivity is not quantified. The manuscript does not report power calculations, minimal detectable effect sizes, or responsiveness metrics for the well-being scales used, leaving open the possibility that modest or domain-specific effects from following advice on health/careers/relationships could go undetected within the 2-3 week window.

    Authors: We agree that reporting power calculations, minimal detectable effect sizes, and scale responsiveness would improve interpretation of the null well-being result. With N=6,474 and a longitudinal design, the study has high power to detect small effects, but these metrics were not included in the original submission. In revision, we will add a post-hoc power analysis for the primary well-being outcomes, report the minimal detectable effect size at the 2-3 week follow-up, and discuss the responsiveness of the scales employed. revision: yes

  2. Referee: [Methods section on adherence measurement] Adherence is assessed exclusively via self-report (reported rates up to 79%). Without any validation against objective behavioral markers or corroborating indicators of actual follow-through, the causal interpretation linking advice receipt to downstream outcomes (including the null well-being result) is weakened.

    Authors: We acknowledge the limitation of relying solely on self-reported adherence without objective behavioral validation. In a large representative sample spanning multiple advice domains, collecting verifiable follow-through data (e.g., documented health or career changes) was not feasible due to scale and participant burden. Self-report is standard in advice-following research. We will revise the discussion to explicitly state this limitation and its implications for causal claims about downstream effects, while noting that the randomized design still supports inference on the overall effect of AI advice receipt versus control. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical RCT with direct measurements

full rationale

The paper reports a longitudinal randomised controlled trial (N=6,474) that directly measures self-reported advice adherence (up to 79%) and well-being outcomes at 2-3 week follow-up via standard scales, comparing AI personal-advice arms to a hobbies control arm. No equations, fitted parameters, model predictions, or derivation chains appear in the abstract or described methods. Claims rest on trial data rather than any self-referential construction, self-citation load-bearing premise, or renamed empirical pattern. The study is self-contained against external benchmarks of RCT design and therefore receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard behavioral-research assumptions about the validity of self-report measures and the sensitivity of well-being scales rather than new free parameters or postulated entities.

axioms (2)
  • domain assumption Participants' self-reports of following AI advice correspond to actual behavioral change.
    Required to interpret the 60-79% adherence rates as evidence of real influence.
  • domain assumption The well-being instruments used are sensitive to any changes produced by following personal advice within 2-3 weeks.
    Necessary to conclude that the absence of difference reflects a true null effect rather than measurement insensitivity.

pith-pipeline@v0.9.0 · 5526 in / 1416 out tokens · 63692 ms · 2026-05-17T21:07:10.427261+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Sycophantic AI makes human interaction feel more effortful and less satisfying over time

    cs.HC 2026-05 unverdicted novelty 6.0

    Longitudinal experiments show sycophantic AI increases reliance on AI for personal advice and lowers satisfaction with real-world social relationships over time.

  2. Sycophantic AI makes human interaction feel more effortful and less satisfying over time

    cs.HC 2026-05 conditional novelty 6.0

    Sycophantic AI delivers quick emotional support like friends but over weeks shifts users toward AI for advice and reduces satisfaction with real human interactions.

  3. Language Model Goal Selection Differs from Humans' in a Self-Directed Learning Task

    cs.CL 2026-02 unverdicted novelty 6.0

    LLMs diverge from human goal selection in self-directed learning by exploiting single solutions with low variability across instances.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 2 Pith papers · 4 internal anchors

  1. [1]

    com / en - uk / nearly - one - five - give - britons - turn - ai - personal - advice - new - ipsos - research-reveals

    Cooper, P.Nearly One in Five Give Britons Turn to AI for Personal Advicehttps://www.ipsos. com / en - uk / nearly - one - five - give - britons - turn - ai - personal - advice - new - ipsos - research-reveals. Accessed: 2025. 2025

  2. [2]

    Conversational AI increases political knowledge as effectively as self-directed internet search

    Luettgau, L.et al. Conversational AI increases political knowledge as effectively as self-directed internet searchPreprint. 2025.https://doi.org/10.48550/arXiv.2509.05219

  3. [3]

    & Choudhury, A

    Shahsavar, Y. & Choudhury, A. User Intentions to Use ChatGPT for Self-Diagnosis and Health- Related Purposes: Cross-sectional Survey Study.JMIR Human Factors10,e47564 (2023)

  4. [4]

    How People Use ChatGPTWorking Paper 34255 (National Bureau of Economic Research, 2025).http://www.nber.org/papers/w34255.pdf

    Chatterji, A.et al. How People Use ChatGPTWorking Paper 34255 (National Bureau of Economic Research, 2025).http://www.nber.org/papers/w34255.pdf

  5. [5]

    anthropic.com/m/7e1ab885d1b24176/original/Clio- Privacy- Preserving- Insights- into- Real-World-AI-Use.pdf

    Anthropic.Clio: Privacy-Preserving Insights into Real-World AI Use2024.https : / / assets . anthropic.com/m/7e1ab885d1b24176/original/Clio- Privacy- Preserving- Insights- into- Real-World-AI-Use.pdf

  6. [6]

    Technological folie\a deux: Feedback loops between ai chatbots and mental illness.arXiv preprint arXiv:2507.19218, 2025

    Dohn´ any, S.et al. Technological folie ` a deux: Feedback Loops Between AI Chatbots and Mental Illness Preprint. 2025.https://doi.org/10.48550/arXiv.2507.19218

  7. [7]

    Journal of Legal Analysis16,64–93 (2024)

    Dahl, M.et al.Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models. Journal of Legal Analysis16,64–93 (2024)

  8. [8]

    Large Language Models (LLMs) for Legal Advice: A Scoping ReviewPreprint

    Krook, J.et al. Large Language Models (LLMs) for Legal Advice: A Scoping ReviewPreprint. 2024. https://doi.org/10.2139/ssrn.4976189

  9. [9]

    JAMA Network Open8,e2457879 (2025)

    Huo, B.et al.Large Language Models for Chatbot Health Advice Studies: A Systematic Review. JAMA Network Open8,e2457879 (2025)

  10. [10]

    Bouguettaya, A., Stuart, E. M. & Aboujaoude, E. Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models.npj Digital Medicine8,332 (2025)

  11. [11]

    L., Choma, M

    Cross, J. L., Choma, M. A. & Onofrey, J. A. Bias in medical AI: Implications for clinical decision- making.PLOS Digital Health3,e0000651 (2024)

  12. [12]

    arXiv preprint arXiv:2404.15149 , year=

    Poulain, R., Fayyaz, H. & Beheshti, R.Bias patterns in the application of LLMs for clinical decision support: A comprehensive studyPreprint. 2024.https://doi.org/10.48550/arXiv.2404.15149

  13. [13]

    Osborne, M. R. & Bailey, E. R. Me vs. the machine? Subjective evaluations of human- and AI- generated advice.Scientific Reports15,3980 (2025)

  14. [14]

    The CHART Collaborativeet al.Reporting Guideline for Chatbot Health Advice Studies: The CHART Statement.JAMA Network Open8,e2530220 (2025)

  15. [15]

    Increasing happiness through conversations with artificial intelligencePreprint

    Heffner, J.et al. Increasing happiness through conversations with artificial intelligencePreprint. 2025.https://doi.org/10.48550/arXiv.2504.02091

  16. [16]

    Structured AI Dialogues Can Increase Happiness and Meaning in LifePreprint

    Sch¨ one, J.et al. Structured AI Dialogues Can Increase Happiness and Meaning in LifePreprint. Oct. 2025.https://doi.org/10.31234/osf.io/2bf7t_v1

  17. [17]

    S., Birch, S

    Tryon, G. S., Birch, S. E. & Verkuilen, J. Meta-analyses of the relation of goal consensus and collaboration to psychotherapy outcome.Psychotherapy55,372–383 (2018)

  18. [18]

    G.et al.The Efficacy of Cognitive Behavioral Therapy: A Review of Meta-analyses

    Hofmann, S. G.et al.The Efficacy of Cognitive Behavioral Therapy: A Review of Meta-analyses. Cognitive Therapy and Research36,427–440 (2012)

  19. [19]

    Bailey, R. R. Goal Setting and Action Planning for Health Behavior Change.American Journal of Lifestyle Medicine13,615–618 (2017)

  20. [20]

    Health Innovation Network South London.Measuring Recoverytech. rep. Accessed: 2025 (Health Innovation Network South London, 2014).https : / / www . healthinnovationoxford . org / wp - content/uploads/2015/11/measuring-recovery-2014.pdf

  21. [21]

    The Benefit of Additional Opinions.Current Directions in Psychological Science13,75–78 (2004)

    Yaniv, I. The Benefit of Additional Opinions.Current Directions in Psychological Science13,75–78 (2004)

  22. [22]

    & Fischer, I

    Harvey, N. & Fischer, I. Taking Advice: Accepting Help, Improving Judgment, and Sharing Re- sponsibility.Organizational Behavior and Human Decision Processes70,117–133.issn: 0749-5978. https://www.sciencedirect.com/science/article/pii/S0749597897926972(1997)

  23. [23]

    J., Simmons, J

    Dietvorst, B. J., Simmons, J. P. & Massey, C. Algorithm aversion: People erroneously avoid algo- rithms after seeing them err.Journal of Experimental Psychology: General144,114–126 (2015). 30

  24. [24]

    C., Li, Y

    Vu, N. C., Li, Y. & High, A. C. Advice Response Theory: A Meta-Analytic Review.Communication Research0(2025)

  25. [25]

    & Stefan, S.-H

    Schultze, T., Rakotoarisoa, A.-F. & Stefan, S.-H. Effects of distance between initial estimates and advice on advice utilization.Judgment and Decision Making10,144–171 (2015)

  26. [26]

    Fang, C. M.et al. How AI and Human Behaviors Shape Psychosocial Effects of Extended Chatbot Use: A Longitudinal Randomized Controlled Study2025. arXiv:2503.17473 [cs.HC].https:// arxiv.org/abs/2503.17473

  27. [27]

    Inves- tigating affective use and emotional well-being on ChatGPT.arXiv preprint arXiv:2504.03888, 2025

    Phang, J.et al. Investigating Affective Use and Emotional Well-being on ChatGPT2025. arXiv: 2504.03888 [cs.HC].https://arxiv.org/abs/2504.03888

  28. [28]

    Kroenke, K., Spitzer, R. L. & Williams, J. B. The Patient Health Questionnaire-2: Validity of a Two-Item Depression Screener.Medical Care41,1284–1292 (2003)

  29. [29]

    Kroenke, K.et al.Anxiety disorders in primary care: prevalence, impairment, comorbidity, and detection.Annals of Internal Medicine146,317–325 (2007)

  30. [30]

    JAMA Internal Medicine174.PMID: 24276929, 399–407 (Mar

    Gierk, B.et al.The somatic symptom scale-8 (SSS-8): a brief measure of somatic symptom burden. JAMA Internal Medicine174.PMID: 24276929, 399–407 (Mar. 2014)

  31. [31]

    D.et al.A scale for the estimation of sleep problems in clinical research.Journal of Clinical Epidemiology41,313–321 (1988)

    Jenkins, C. D.et al.A scale for the estimation of sleep problems in clinical research.Journal of Clinical Epidemiology41,313–321 (1988)

  32. [32]

    World Health Organization (Geneva, 2024)

    World Health Organization.The World Health Organization-Five Well-Being Index (WHO-5)Li- cense: CC-BY-NC-SA 3.0 IGO. World Health Organization (Geneva, 2024)

  33. [33]

    & Hicks, S.Measuring subjective well-beingtech

    Tinkler, L. & Hicks, S.Measuring subjective well-beingtech. rep. (Office for National Statistics, 2011)

  34. [34]

    Keyes, C. L. M.Social Well-Being ScaleAPA PsycTests. 1998.https://doi.org/10.1037/t13598- 000

  35. [35]

    Van Katwyk, P. T.et al. Job-Related Affective Well-Being Scale (JAWS)APA PsycTests. 2000. https://doi.org/10.1037/t01753-000

  36. [36]

    Watson, D., Clark, L. A. & Tellegen, A. Development and validation of brief measures of positive and negative affect: The PANAS scales.Journal of Personality and Social Psychology54,1063–1070 (1988)

  37. [37]

    Killgore, W. D. S. The Affect Grid: A moderately valid, nonspecific measure of pleasure and arousal. Psychological Reports83,639–642 (1998)

  38. [38]

    Negative Anecdotes Reduce Policy Support: Evidence from Three Experimental Studies on Communicating Policy (In) EffectivenessPreprint

    Rodger, A.et al. Negative Anecdotes Reduce Policy Support: Evidence from Three Experimental Studies on Communicating Policy (In) EffectivenessPreprint. 2025.https://osf.io/e2kxc_v1/

  39. [39]

    HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statis- tics2025

    Luettgau, L.et al. HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statis- tics2025. arXiv:2505.05602 [cs.AI].https://arxiv.org/abs/2505.05602

  40. [40]

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal

    Dubois, M.et al. Skewed Score: A statistical framework to assess autograders2025. arXiv:2507. 03772 [cs.LG].https://arxiv.org/abs/2507.03772

  41. [41]

    Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro

    Phan, D., Pradhan, N. & Jankowiak, M. Composable Effects for Flexible and Accelerated Proba- bilistic Programming in NumPyro.http://arxiv.org/abs/1912.11554(Dec. 2019)

  42. [42]

    Hoffman, M. D. & Gelman, A. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamil- tonian Monte Carlo.http://arxiv.org/abs/1111.4246(Nov. 2011)

  43. [43]

    Bayesian Data Analysis3rd (2013)

    Gelman, A.et al. Bayesian Data Analysis3rd (2013)

  44. [44]

    Did the advice you followed make you feel better?

    Watanabe, S. Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.Journal of Machine Learning Research11,3571–3594 (2010). 31 Supplementary Information 32 Supplementary Figure S1:Sociodemographic variable distributions in the full sample (N= 2,302). Supplementary Figure S2:Self-reported u...