People readily follow personal advice from AI but it does not improve their well-being

Bessie O'Dell; Christopher Summerfield; Hannah Rose Kirk; Henry Davidson; Jessica Bergs; Keno Juechems; Lennart Luettgau; Luke Symes; Magda Dubois; Max Rollwage

arxiv: 2511.15352 · v3 · submitted 2025-11-19 · 💻 cs.HC

People readily follow personal advice from AI but it does not improve their well-being

Lennart Luettgau , Vanessa Cheung , Magda Dubois , Keno Juechems , Jessica Bergs , Luke Symes , Henry Davidson , Bessie O'Dell

show 3 more authors

Hannah Rose Kirk Max Rollwage Christopher Summerfield

This is my paper

Pith reviewed 2026-05-17 21:07 UTC · model grok-4.3

classification 💻 cs.HC

keywords AI advicewell-beingrandomized controlled trialLLMadvice adherencepersonal decisions

0 comments

The pith

People follow personal advice from AI chatbots at high rates but gain no sustained well-being benefits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether advice from large language models on personal topics like health, careers, or relationships actually improves well-being. In a large trial, most participants reported acting on the AI suggestions, including many high-stakes ones. Yet two to three weeks later, their well-being scores showed no advantage over a control group that only discussed hobbies with the same chatbots. A reader would care because growing numbers of people now turn to AI for life guidance, raising the question of whether this influence produces real value.

Core claim

In a longitudinal randomised controlled trial with a representative UK sample of 6,474 participants, up to 79% of those who discussed personal topics with AI chatbots reported following the advice, with rates remaining above 60% even for high-stakes recommendations. The advice itself rarely violated safety best practices according to transcript evaluations. However, participants who received personal advice showed no sustained well-being benefits 2-3 weeks later compared to those who discussed hobbies and interests with the same chatbots.

What carries the argument

The randomised controlled trial that compares self-reported adherence to AI personal advice against a hobby-discussion control, with well-being tracked via scales after 2-3 weeks.

If this is right

AI chatbots can substantially shape users' real-world personal decisions across health, career, and relationship domains.
Reliance on AI advice shows weak calibration to the potential consequences of following it.
Consumer LLMs provide advice that generally aligns with safety best practices.
Short-term interactions with AI for personal advice do not produce measurable improvements in psychological well-being.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Repeated or extended conversations with AI might be required before any well-being effects appear.
Objective records of behavior, such as actual health or career metrics, could reveal effects hidden by self-reports.
AI systems may function better as sources of information than as substitutes for human support in personal growth.

Load-bearing premise

Self-reported adherence to advice accurately reflects real behavioral change and the chosen well-being scales are sensitive enough to detect any benefits that might occur within a 2-3 week window.

What would settle it

An independent study that tracks objective behavioral changes or longer-term outcomes and finds larger improvements in the personal-advice group than in the hobby-discussion group.

Figures

Figures reproduced from arXiv: 2511.15352 by Bessie O'Dell, Christopher Summerfield, Hannah Rose Kirk, Henry Davidson, Jessica Bergs, Keno Juechems, Lennart Luettgau, Luke Symes, Magda Dubois, Max Rollwage, Vanessa Cheung.

**Figure 1.** Figure 1: A. Schematic of the experimental design and study procedure on both Session 1 and Session 2, including details of the randomisation and tests administered. B. Example pathway-specific questions (administered on Session 1). C. Advice density among chatbot utterances (LLM classification) across control and experimental conditions, D. Advice density across experimental conditions (Safety, Actionability, Pers… view at source ↗

**Figure 2.** Figure 2: A. Self-reported advice received (Session 1, immediately after the conversation) and advice followed (Session 2) across control (brown dots) and experimental conditions (green dots); large dots show means, error bars are 95% confidence intervals. B. Percentage of advice-following across experimental conditions (Safety, Actionability, Personal Information). C. Bayesian GLM posterior parameter estimates for … view at source ↗

**Figure 3.** Figure 3: A. Self-reported advice-following (dark blue) and advice received (light blue) counts, categorised by themes derived from LLM-based content analysis. B. Self-reported advice-following percentage across levels of problem severity derived from PCA scores combining self- and LLM autograder-assessed problem severity. Note that these analyses only include participants in the experimental group, as control group… view at source ↗

**Figure 4.** Figure 4: A. Average subjective advice value (Session 2) across control (brown dots) and experimental conditions (green dots), separately for participants who followed vs did not follow the advice, large dots show means, error bars are 95% confidence intervals. B. Average subjective advice value (Session 2) across experimental conditions (Safety, Actionability, Personal Information). C. Bayesian GLM posterior parame… view at source ↗

**Figure 5.** Figure 5: A. Well-being factor scores over timepoints across experimental and control conditions, separately for participants who followed and did not follow AI advice (well-being factor scores from factor analysis based on PHQ, GAD, SSS, JSS, WHO-5, ONS, SWBS, JAWS, PANAS, Affect grid arousal and valence; see Supplementary Fig. S6; Session 1 POST: Session 1 POST – Session 1 PRE; Session 2: Session 2 – Session 1 PRE… view at source ↗

read the original abstract

People increasingly seek personal advice from large language models (LLMs), yet whether humans follow their advice, and its consequences for their well-being, remains unknown. In a longitudinal randomised controlled trial with a representative UK sample (N = 6,474), we found that up to 79% of participants who had a 20-minute discussion with one of three AI chatbots (GPT-4o, LLama-3.3-70B, Gemini 3 Pro) about health, careers or relationships subsequently reported following its advice. Advice-following remained above 60% even for high-stakes recommendations, suggesting that users only weakly calibrate their reliance on AI advice to potential consequences. Based on autograder evaluations of chat transcripts, LLM advice rarely violated safety best practice. However, when queried 2-3 weeks later, participants receiving personal advice from AI showed no sustained well-being benefits compared to a control group who discussed hobbies and interests with the same chatbots. These findings reveal that consumer LLMs exert substantial influence over real-world personal decisions without delivering measurable psychological benefits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Large RCT shows high self-reported adherence to AI personal advice but no short-term well-being gains versus hobby chat control.

read the letter

The punchline is straightforward: participants followed AI advice on health, careers, or relationships at rates up to 79 percent, yet showed no sustained well-being improvement after two to three weeks compared with controls who just discussed hobbies with the same models. The study uses a representative UK sample of 6,474 people and random assignment across GPT-4o, Llama-3.3-70B, and Gemini 3 Pro, which gives the null result on well-being more weight than smaller prior work. They also ran autograder checks on transcripts and found the advice rarely violated safety norms, which is a useful concrete detail. The design is clean enough on randomization and sample size to support the main pattern of influence without corresponding psychological payoff. Self-reported adherence is the clearest soft spot; without objective behavioral markers it is hard to know how much actual change occurred. The well-being scales and short follow-up window also leave open the possibility that modest or domain-specific effects were missed, especially if the instruments are not highly responsive over two to three weeks. Attrition details and exact power for the null would help pin that down. This paper is aimed at researchers working on human-AI interaction, AI ethics, or the real-world effects of consumer chatbots. Readers who need large-scale evidence on advice-taking or who study technology and mental health will get direct value from the adherence numbers and the control comparison. It is coherent on its own terms and grounded in a proper trial, so it deserves a serious referee rather than a desk reject. I would send it out for review and ask for more on measurement validation and longer-term follow-up in revisions.

Referee Report

2 major / 2 minor

Summary. The manuscript reports results from a longitudinal randomized controlled trial with a representative UK sample (N=6,474). Participants engaged in 20-minute discussions with one of three AI chatbots (GPT-4o, Llama-3.3-70B, or Gemini 3 Pro) on health, careers, or relationships (treatment) or hobbies and interests (control). The central claims are that up to 79% of participants reported following the AI advice (remaining above 60% for high-stakes recommendations), that LLM advice rarely violated safety best practices per autograder evaluation of transcripts, and that receiving personal advice produced no sustained well-being benefits at 2-3 week follow-up relative to the control condition.

Significance. If the null result on well-being holds, the study offers timely evidence on the substantial real-world influence of consumer LLMs over personal decisions without corresponding psychological benefits. The randomized design, large representative sample, and longitudinal structure provide solid grounding for the adherence and safety findings and contribute directly to HCI research on AI-mediated personal advice.

major comments (2)

[Results section on well-being outcomes and follow-up assessments] The null finding on well-being benefits (central to the paper's second claim) rests on follow-up assessments whose sensitivity is not quantified. The manuscript does not report power calculations, minimal detectable effect sizes, or responsiveness metrics for the well-being scales used, leaving open the possibility that modest or domain-specific effects from following advice on health/careers/relationships could go undetected within the 2-3 week window.
[Methods section on adherence measurement] Adherence is assessed exclusively via self-report (reported rates up to 79%). Without any validation against objective behavioral markers or corroborating indicators of actual follow-through, the causal interpretation linking advice receipt to downstream outcomes (including the null well-being result) is weakened.

minor comments (2)

[Abstract] The abstract states that 'LLM advice rarely violated safety best practice' but provides no detail on the autograder criteria or thresholds; a concise description would aid interpretability.
[Discussion] The discussion would benefit from an explicit limitations paragraph addressing the short follow-up interval and the reliance on self-reported adherence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of our longitudinal RCT on AI advice adherence and well-being outcomes. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Results section on well-being outcomes and follow-up assessments] The null finding on well-being benefits (central to the paper's second claim) rests on follow-up assessments whose sensitivity is not quantified. The manuscript does not report power calculations, minimal detectable effect sizes, or responsiveness metrics for the well-being scales used, leaving open the possibility that modest or domain-specific effects from following advice on health/careers/relationships could go undetected within the 2-3 week window.

Authors: We agree that reporting power calculations, minimal detectable effect sizes, and scale responsiveness would improve interpretation of the null well-being result. With N=6,474 and a longitudinal design, the study has high power to detect small effects, but these metrics were not included in the original submission. In revision, we will add a post-hoc power analysis for the primary well-being outcomes, report the minimal detectable effect size at the 2-3 week follow-up, and discuss the responsiveness of the scales employed. revision: yes
Referee: [Methods section on adherence measurement] Adherence is assessed exclusively via self-report (reported rates up to 79%). Without any validation against objective behavioral markers or corroborating indicators of actual follow-through, the causal interpretation linking advice receipt to downstream outcomes (including the null well-being result) is weakened.

Authors: We acknowledge the limitation of relying solely on self-reported adherence without objective behavioral validation. In a large representative sample spanning multiple advice domains, collecting verifiable follow-through data (e.g., documented health or career changes) was not feasible due to scale and participant burden. Self-report is standard in advice-following research. We will revise the discussion to explicitly state this limitation and its implications for causal claims about downstream effects, while noting that the randomized design still supports inference on the overall effect of AI advice receipt versus control. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical RCT with direct measurements

full rationale

The paper reports a longitudinal randomised controlled trial (N=6,474) that directly measures self-reported advice adherence (up to 79%) and well-being outcomes at 2-3 week follow-up via standard scales, comparing AI personal-advice arms to a hobbies control arm. No equations, fitted parameters, model predictions, or derivation chains appear in the abstract or described methods. Claims rest on trial data rather than any self-referential construction, self-citation load-bearing premise, or renamed empirical pattern. The study is self-contained against external benchmarks of RCT design and therefore receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard behavioral-research assumptions about the validity of self-report measures and the sensitivity of well-being scales rather than new free parameters or postulated entities.

axioms (2)

domain assumption Participants' self-reports of following AI advice correspond to actual behavioral change.
Required to interpret the 60-79% adherence rates as evidence of real influence.
domain assumption The well-being instruments used are sensitive to any changes produced by following personal advice within 2-3 weeks.
Necessary to conclude that the absence of difference reflects a true null effect rather than measurement insensitivity.

pith-pipeline@v0.9.0 · 5526 in / 1416 out tokens · 63692 ms · 2026-05-17T21:07:10.427261+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

longitudinal randomised controlled trial ... well-being factor scores ... PHQ-2, GAD-2 ... advice-following remained above 60% even for high-stakes recommendations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sycophantic AI makes human interaction feel more effortful and less satisfying over time
cs.HC 2026-05 unverdicted novelty 6.0

Longitudinal experiments show sycophantic AI increases reliance on AI for personal advice and lowers satisfaction with real-world social relationships over time.
Sycophantic AI makes human interaction feel more effortful and less satisfying over time
cs.HC 2026-05 conditional novelty 6.0

Sycophantic AI delivers quick emotional support like friends but over weeks shifts users toward AI for advice and reduces satisfaction with real human interactions.
Language Model Goal Selection Differs from Humans' in a Self-Directed Learning Task
cs.CL 2026-02 unverdicted novelty 6.0

LLMs diverge from human goal selection in self-directed learning by exploiting single solutions with low variability across instances.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 2 Pith papers · 4 internal anchors

[1]

com / en - uk / nearly - one - five - give - britons - turn - ai - personal - advice - new - ipsos - research-reveals

Cooper, P.Nearly One in Five Give Britons Turn to AI for Personal Advicehttps://www.ipsos. com / en - uk / nearly - one - five - give - britons - turn - ai - personal - advice - new - ipsos - research-reveals. Accessed: 2025. 2025

work page 2025
[2]

Conversational AI increases political knowledge as effectively as self-directed internet search

Luettgau, L.et al. Conversational AI increases political knowledge as effectively as self-directed internet searchPreprint. 2025.https://doi.org/10.48550/arXiv.2509.05219

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.05219 2025
[3]

& Choudhury, A

Shahsavar, Y. & Choudhury, A. User Intentions to Use ChatGPT for Self-Diagnosis and Health- Related Purposes: Cross-sectional Survey Study.JMIR Human Factors10,e47564 (2023)

work page 2023
[4]

How People Use ChatGPTWorking Paper 34255 (National Bureau of Economic Research, 2025).http://www.nber.org/papers/w34255.pdf

Chatterji, A.et al. How People Use ChatGPTWorking Paper 34255 (National Bureau of Economic Research, 2025).http://www.nber.org/papers/w34255.pdf

work page 2025
[5]

anthropic.com/m/7e1ab885d1b24176/original/Clio- Privacy- Preserving- Insights- into- Real-World-AI-Use.pdf

Anthropic.Clio: Privacy-Preserving Insights into Real-World AI Use2024.https : / / assets . anthropic.com/m/7e1ab885d1b24176/original/Clio- Privacy- Preserving- Insights- into- Real-World-AI-Use.pdf

work page
[6]

Technological folie\a deux: Feedback loops between ai chatbots and mental illness.arXiv preprint arXiv:2507.19218, 2025

Dohn´ any, S.et al. Technological folie ` a deux: Feedback Loops Between AI Chatbots and Mental Illness Preprint. 2025.https://doi.org/10.48550/arXiv.2507.19218

work page doi:10.48550/arxiv.2507.19218 2025
[7]

Journal of Legal Analysis16,64–93 (2024)

Dahl, M.et al.Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models. Journal of Legal Analysis16,64–93 (2024)

work page 2024
[8]

Large Language Models (LLMs) for Legal Advice: A Scoping ReviewPreprint

Krook, J.et al. Large Language Models (LLMs) for Legal Advice: A Scoping ReviewPreprint. 2024. https://doi.org/10.2139/ssrn.4976189

work page doi:10.2139/ssrn.4976189 2024
[9]

JAMA Network Open8,e2457879 (2025)

Huo, B.et al.Large Language Models for Chatbot Health Advice Studies: A Systematic Review. JAMA Network Open8,e2457879 (2025)

work page 2025
[10]

Bouguettaya, A., Stuart, E. M. & Aboujaoude, E. Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models.npj Digital Medicine8,332 (2025)

work page 2025
[11]

L., Choma, M

Cross, J. L., Choma, M. A. & Onofrey, J. A. Bias in medical AI: Implications for clinical decision- making.PLOS Digital Health3,e0000651 (2024)

work page 2024
[12]

arXiv preprint arXiv:2404.15149 , year=

Poulain, R., Fayyaz, H. & Beheshti, R.Bias patterns in the application of LLMs for clinical decision support: A comprehensive studyPreprint. 2024.https://doi.org/10.48550/arXiv.2404.15149

work page doi:10.48550/arxiv.2404.15149 2024
[13]

Osborne, M. R. & Bailey, E. R. Me vs. the machine? Subjective evaluations of human- and AI- generated advice.Scientific Reports15,3980 (2025)

work page 2025
[14]

The CHART Collaborativeet al.Reporting Guideline for Chatbot Health Advice Studies: The CHART Statement.JAMA Network Open8,e2530220 (2025)

work page 2025
[15]

Increasing happiness through conversations with artificial intelligencePreprint

Heffner, J.et al. Increasing happiness through conversations with artificial intelligencePreprint. 2025.https://doi.org/10.48550/arXiv.2504.02091

work page doi:10.48550/arxiv.2504.02091 2025
[16]

Structured AI Dialogues Can Increase Happiness and Meaning in LifePreprint

Sch¨ one, J.et al. Structured AI Dialogues Can Increase Happiness and Meaning in LifePreprint. Oct. 2025.https://doi.org/10.31234/osf.io/2bf7t_v1

work page doi:10.31234/osf.io/2bf7t_v1 2025
[17]

S., Birch, S

Tryon, G. S., Birch, S. E. & Verkuilen, J. Meta-analyses of the relation of goal consensus and collaboration to psychotherapy outcome.Psychotherapy55,372–383 (2018)

work page 2018
[18]

G.et al.The Efficacy of Cognitive Behavioral Therapy: A Review of Meta-analyses

Hofmann, S. G.et al.The Efficacy of Cognitive Behavioral Therapy: A Review of Meta-analyses. Cognitive Therapy and Research36,427–440 (2012)

work page 2012
[19]

Bailey, R. R. Goal Setting and Action Planning for Health Behavior Change.American Journal of Lifestyle Medicine13,615–618 (2017)

work page 2017
[20]

Health Innovation Network South London.Measuring Recoverytech. rep. Accessed: 2025 (Health Innovation Network South London, 2014).https : / / www . healthinnovationoxford . org / wp - content/uploads/2015/11/measuring-recovery-2014.pdf

work page 2025
[21]

The Benefit of Additional Opinions.Current Directions in Psychological Science13,75–78 (2004)

Yaniv, I. The Benefit of Additional Opinions.Current Directions in Psychological Science13,75–78 (2004)

work page 2004
[22]

& Fischer, I

Harvey, N. & Fischer, I. Taking Advice: Accepting Help, Improving Judgment, and Sharing Re- sponsibility.Organizational Behavior and Human Decision Processes70,117–133.issn: 0749-5978. https://www.sciencedirect.com/science/article/pii/S0749597897926972(1997)

work page 1997
[23]

J., Simmons, J

Dietvorst, B. J., Simmons, J. P. & Massey, C. Algorithm aversion: People erroneously avoid algo- rithms after seeing them err.Journal of Experimental Psychology: General144,114–126 (2015). 30

work page 2015
[24]

C., Li, Y

Vu, N. C., Li, Y. & High, A. C. Advice Response Theory: A Meta-Analytic Review.Communication Research0(2025)

work page 2025
[25]

& Stefan, S.-H

Schultze, T., Rakotoarisoa, A.-F. & Stefan, S.-H. Effects of distance between initial estimates and advice on advice utilization.Judgment and Decision Making10,144–171 (2015)

work page 2015
[26]

Fang, C. M.et al. How AI and Human Behaviors Shape Psychosocial Effects of Extended Chatbot Use: A Longitudinal Randomized Controlled Study2025. arXiv:2503.17473 [cs.HC].https:// arxiv.org/abs/2503.17473

work page internal anchor Pith review arXiv
[27]

Inves- tigating affective use and emotional well-being on ChatGPT.arXiv preprint arXiv:2504.03888, 2025

Phang, J.et al. Investigating Affective Use and Emotional Well-being on ChatGPT2025. arXiv: 2504.03888 [cs.HC].https://arxiv.org/abs/2504.03888

work page arXiv
[28]

Kroenke, K., Spitzer, R. L. & Williams, J. B. The Patient Health Questionnaire-2: Validity of a Two-Item Depression Screener.Medical Care41,1284–1292 (2003)

work page 2003
[29]

Kroenke, K.et al.Anxiety disorders in primary care: prevalence, impairment, comorbidity, and detection.Annals of Internal Medicine146,317–325 (2007)

work page 2007
[30]

JAMA Internal Medicine174.PMID: 24276929, 399–407 (Mar

Gierk, B.et al.The somatic symptom scale-8 (SSS-8): a brief measure of somatic symptom burden. JAMA Internal Medicine174.PMID: 24276929, 399–407 (Mar. 2014)

work page 2014
[31]

D.et al.A scale for the estimation of sleep problems in clinical research.Journal of Clinical Epidemiology41,313–321 (1988)

Jenkins, C. D.et al.A scale for the estimation of sleep problems in clinical research.Journal of Clinical Epidemiology41,313–321 (1988)

work page 1988
[32]

World Health Organization (Geneva, 2024)

World Health Organization.The World Health Organization-Five Well-Being Index (WHO-5)Li- cense: CC-BY-NC-SA 3.0 IGO. World Health Organization (Geneva, 2024)

work page 2024
[33]

& Hicks, S.Measuring subjective well-beingtech

Tinkler, L. & Hicks, S.Measuring subjective well-beingtech. rep. (Office for National Statistics, 2011)

work page 2011
[34]

Keyes, C. L. M.Social Well-Being ScaleAPA PsycTests. 1998.https://doi.org/10.1037/t13598- 000

work page doi:10.1037/t13598- 1998
[35]

Van Katwyk, P. T.et al. Job-Related Affective Well-Being Scale (JAWS)APA PsycTests. 2000. https://doi.org/10.1037/t01753-000

work page doi:10.1037/t01753-000 2000
[36]

Watson, D., Clark, L. A. & Tellegen, A. Development and validation of brief measures of positive and negative affect: The PANAS scales.Journal of Personality and Social Psychology54,1063–1070 (1988)

work page 1988
[37]

Killgore, W. D. S. The Affect Grid: A moderately valid, nonspecific measure of pleasure and arousal. Psychological Reports83,639–642 (1998)

work page 1998
[38]

Negative Anecdotes Reduce Policy Support: Evidence from Three Experimental Studies on Communicating Policy (In) EffectivenessPreprint

Rodger, A.et al. Negative Anecdotes Reduce Policy Support: Evidence from Three Experimental Studies on Communicating Policy (In) EffectivenessPreprint. 2025.https://osf.io/e2kxc_v1/

work page 2025
[39]

HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statis- tics2025

Luettgau, L.et al. HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statis- tics2025. arXiv:2505.05602 [cs.AI].https://arxiv.org/abs/2505.05602

work page arXiv
[40]

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal

Dubois, M.et al. Skewed Score: A statistical framework to assess autograders2025. arXiv:2507. 03772 [cs.LG].https://arxiv.org/abs/2507.03772

work page arXiv
[41]

Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro

Phan, D., Pradhan, N. & Jankowiak, M. Composable Effects for Flexible and Accelerated Proba- bilistic Programming in NumPyro.http://arxiv.org/abs/1912.11554(Dec. 2019)

work page internal anchor Pith review Pith/arXiv arXiv 1912
[42]

Hoffman, M. D. & Gelman, A. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamil- tonian Monte Carlo.http://arxiv.org/abs/1111.4246(Nov. 2011)

work page internal anchor Pith review Pith/arXiv arXiv 2011
[43]

Bayesian Data Analysis3rd (2013)

Gelman, A.et al. Bayesian Data Analysis3rd (2013)

work page 2013
[44]

Did the advice you followed make you feel better?

Watanabe, S. Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.Journal of Machine Learning Research11,3571–3594 (2010). 31 Supplementary Information 32 Supplementary Figure S1:Sociodemographic variable distributions in the full sample (N= 2,302). Supplementary Figure S2:Self-reported u...

work page 2010

[1] [1]

com / en - uk / nearly - one - five - give - britons - turn - ai - personal - advice - new - ipsos - research-reveals

Cooper, P.Nearly One in Five Give Britons Turn to AI for Personal Advicehttps://www.ipsos. com / en - uk / nearly - one - five - give - britons - turn - ai - personal - advice - new - ipsos - research-reveals. Accessed: 2025. 2025

work page 2025

[2] [2]

Conversational AI increases political knowledge as effectively as self-directed internet search

Luettgau, L.et al. Conversational AI increases political knowledge as effectively as self-directed internet searchPreprint. 2025.https://doi.org/10.48550/arXiv.2509.05219

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.05219 2025

[3] [3]

& Choudhury, A

Shahsavar, Y. & Choudhury, A. User Intentions to Use ChatGPT for Self-Diagnosis and Health- Related Purposes: Cross-sectional Survey Study.JMIR Human Factors10,e47564 (2023)

work page 2023

[4] [4]

How People Use ChatGPTWorking Paper 34255 (National Bureau of Economic Research, 2025).http://www.nber.org/papers/w34255.pdf

Chatterji, A.et al. How People Use ChatGPTWorking Paper 34255 (National Bureau of Economic Research, 2025).http://www.nber.org/papers/w34255.pdf

work page 2025

[5] [5]

anthropic.com/m/7e1ab885d1b24176/original/Clio- Privacy- Preserving- Insights- into- Real-World-AI-Use.pdf

Anthropic.Clio: Privacy-Preserving Insights into Real-World AI Use2024.https : / / assets . anthropic.com/m/7e1ab885d1b24176/original/Clio- Privacy- Preserving- Insights- into- Real-World-AI-Use.pdf

work page

[6] [6]

Technological folie\a deux: Feedback loops between ai chatbots and mental illness.arXiv preprint arXiv:2507.19218, 2025

Dohn´ any, S.et al. Technological folie ` a deux: Feedback Loops Between AI Chatbots and Mental Illness Preprint. 2025.https://doi.org/10.48550/arXiv.2507.19218

work page doi:10.48550/arxiv.2507.19218 2025

[7] [7]

Journal of Legal Analysis16,64–93 (2024)

Dahl, M.et al.Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models. Journal of Legal Analysis16,64–93 (2024)

work page 2024

[8] [8]

Large Language Models (LLMs) for Legal Advice: A Scoping ReviewPreprint

Krook, J.et al. Large Language Models (LLMs) for Legal Advice: A Scoping ReviewPreprint. 2024. https://doi.org/10.2139/ssrn.4976189

work page doi:10.2139/ssrn.4976189 2024

[9] [9]

JAMA Network Open8,e2457879 (2025)

Huo, B.et al.Large Language Models for Chatbot Health Advice Studies: A Systematic Review. JAMA Network Open8,e2457879 (2025)

work page 2025

[10] [10]

Bouguettaya, A., Stuart, E. M. & Aboujaoude, E. Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models.npj Digital Medicine8,332 (2025)

work page 2025

[11] [11]

L., Choma, M

Cross, J. L., Choma, M. A. & Onofrey, J. A. Bias in medical AI: Implications for clinical decision- making.PLOS Digital Health3,e0000651 (2024)

work page 2024

[12] [12]

arXiv preprint arXiv:2404.15149 , year=

Poulain, R., Fayyaz, H. & Beheshti, R.Bias patterns in the application of LLMs for clinical decision support: A comprehensive studyPreprint. 2024.https://doi.org/10.48550/arXiv.2404.15149

work page doi:10.48550/arxiv.2404.15149 2024

[13] [13]

Osborne, M. R. & Bailey, E. R. Me vs. the machine? Subjective evaluations of human- and AI- generated advice.Scientific Reports15,3980 (2025)

work page 2025

[14] [14]

The CHART Collaborativeet al.Reporting Guideline for Chatbot Health Advice Studies: The CHART Statement.JAMA Network Open8,e2530220 (2025)

work page 2025

[15] [15]

Increasing happiness through conversations with artificial intelligencePreprint

Heffner, J.et al. Increasing happiness through conversations with artificial intelligencePreprint. 2025.https://doi.org/10.48550/arXiv.2504.02091

work page doi:10.48550/arxiv.2504.02091 2025

[16] [16]

Structured AI Dialogues Can Increase Happiness and Meaning in LifePreprint

Sch¨ one, J.et al. Structured AI Dialogues Can Increase Happiness and Meaning in LifePreprint. Oct. 2025.https://doi.org/10.31234/osf.io/2bf7t_v1

work page doi:10.31234/osf.io/2bf7t_v1 2025

[17] [17]

S., Birch, S

Tryon, G. S., Birch, S. E. & Verkuilen, J. Meta-analyses of the relation of goal consensus and collaboration to psychotherapy outcome.Psychotherapy55,372–383 (2018)

work page 2018

[18] [18]

G.et al.The Efficacy of Cognitive Behavioral Therapy: A Review of Meta-analyses

Hofmann, S. G.et al.The Efficacy of Cognitive Behavioral Therapy: A Review of Meta-analyses. Cognitive Therapy and Research36,427–440 (2012)

work page 2012

[19] [19]

Bailey, R. R. Goal Setting and Action Planning for Health Behavior Change.American Journal of Lifestyle Medicine13,615–618 (2017)

work page 2017

[20] [20]

Health Innovation Network South London.Measuring Recoverytech. rep. Accessed: 2025 (Health Innovation Network South London, 2014).https : / / www . healthinnovationoxford . org / wp - content/uploads/2015/11/measuring-recovery-2014.pdf

work page 2025

[21] [21]

The Benefit of Additional Opinions.Current Directions in Psychological Science13,75–78 (2004)

Yaniv, I. The Benefit of Additional Opinions.Current Directions in Psychological Science13,75–78 (2004)

work page 2004

[22] [22]

& Fischer, I

Harvey, N. & Fischer, I. Taking Advice: Accepting Help, Improving Judgment, and Sharing Re- sponsibility.Organizational Behavior and Human Decision Processes70,117–133.issn: 0749-5978. https://www.sciencedirect.com/science/article/pii/S0749597897926972(1997)

work page 1997

[23] [23]

J., Simmons, J

Dietvorst, B. J., Simmons, J. P. & Massey, C. Algorithm aversion: People erroneously avoid algo- rithms after seeing them err.Journal of Experimental Psychology: General144,114–126 (2015). 30

work page 2015

[24] [24]

C., Li, Y

Vu, N. C., Li, Y. & High, A. C. Advice Response Theory: A Meta-Analytic Review.Communication Research0(2025)

work page 2025

[25] [25]

& Stefan, S.-H

Schultze, T., Rakotoarisoa, A.-F. & Stefan, S.-H. Effects of distance between initial estimates and advice on advice utilization.Judgment and Decision Making10,144–171 (2015)

work page 2015

[26] [26]

Fang, C. M.et al. How AI and Human Behaviors Shape Psychosocial Effects of Extended Chatbot Use: A Longitudinal Randomized Controlled Study2025. arXiv:2503.17473 [cs.HC].https:// arxiv.org/abs/2503.17473

work page internal anchor Pith review arXiv

[27] [27]

Inves- tigating affective use and emotional well-being on ChatGPT.arXiv preprint arXiv:2504.03888, 2025

Phang, J.et al. Investigating Affective Use and Emotional Well-being on ChatGPT2025. arXiv: 2504.03888 [cs.HC].https://arxiv.org/abs/2504.03888

work page arXiv

[28] [28]

Kroenke, K., Spitzer, R. L. & Williams, J. B. The Patient Health Questionnaire-2: Validity of a Two-Item Depression Screener.Medical Care41,1284–1292 (2003)

work page 2003

[29] [29]

Kroenke, K.et al.Anxiety disorders in primary care: prevalence, impairment, comorbidity, and detection.Annals of Internal Medicine146,317–325 (2007)

work page 2007

[30] [30]

JAMA Internal Medicine174.PMID: 24276929, 399–407 (Mar

Gierk, B.et al.The somatic symptom scale-8 (SSS-8): a brief measure of somatic symptom burden. JAMA Internal Medicine174.PMID: 24276929, 399–407 (Mar. 2014)

work page 2014

[31] [31]

D.et al.A scale for the estimation of sleep problems in clinical research.Journal of Clinical Epidemiology41,313–321 (1988)

Jenkins, C. D.et al.A scale for the estimation of sleep problems in clinical research.Journal of Clinical Epidemiology41,313–321 (1988)

work page 1988

[32] [32]

World Health Organization (Geneva, 2024)

World Health Organization.The World Health Organization-Five Well-Being Index (WHO-5)Li- cense: CC-BY-NC-SA 3.0 IGO. World Health Organization (Geneva, 2024)

work page 2024

[33] [33]

& Hicks, S.Measuring subjective well-beingtech

Tinkler, L. & Hicks, S.Measuring subjective well-beingtech. rep. (Office for National Statistics, 2011)

work page 2011

[34] [34]

Keyes, C. L. M.Social Well-Being ScaleAPA PsycTests. 1998.https://doi.org/10.1037/t13598- 000

work page doi:10.1037/t13598- 1998

[35] [35]

Van Katwyk, P. T.et al. Job-Related Affective Well-Being Scale (JAWS)APA PsycTests. 2000. https://doi.org/10.1037/t01753-000

work page doi:10.1037/t01753-000 2000

[36] [36]

Watson, D., Clark, L. A. & Tellegen, A. Development and validation of brief measures of positive and negative affect: The PANAS scales.Journal of Personality and Social Psychology54,1063–1070 (1988)

work page 1988

[37] [37]

Killgore, W. D. S. The Affect Grid: A moderately valid, nonspecific measure of pleasure and arousal. Psychological Reports83,639–642 (1998)

work page 1998

[38] [38]

Negative Anecdotes Reduce Policy Support: Evidence from Three Experimental Studies on Communicating Policy (In) EffectivenessPreprint

Rodger, A.et al. Negative Anecdotes Reduce Policy Support: Evidence from Three Experimental Studies on Communicating Policy (In) EffectivenessPreprint. 2025.https://osf.io/e2kxc_v1/

work page 2025

[39] [39]

HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statis- tics2025

Luettgau, L.et al. HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statis- tics2025. arXiv:2505.05602 [cs.AI].https://arxiv.org/abs/2505.05602

work page arXiv

[40] [40]

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal

Dubois, M.et al. Skewed Score: A statistical framework to assess autograders2025. arXiv:2507. 03772 [cs.LG].https://arxiv.org/abs/2507.03772

work page arXiv

[41] [41]

Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro

Phan, D., Pradhan, N. & Jankowiak, M. Composable Effects for Flexible and Accelerated Proba- bilistic Programming in NumPyro.http://arxiv.org/abs/1912.11554(Dec. 2019)

work page internal anchor Pith review Pith/arXiv arXiv 1912

[42] [42]

Hoffman, M. D. & Gelman, A. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamil- tonian Monte Carlo.http://arxiv.org/abs/1111.4246(Nov. 2011)

work page internal anchor Pith review Pith/arXiv arXiv 2011

[43] [43]

Bayesian Data Analysis3rd (2013)

Gelman, A.et al. Bayesian Data Analysis3rd (2013)

work page 2013

[44] [44]

Did the advice you followed make you feel better?

Watanabe, S. Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.Journal of Machine Learning Research11,3571–3594 (2010). 31 Supplementary Information 32 Supplementary Figure S1:Sociodemographic variable distributions in the full sample (N= 2,302). Supplementary Figure S2:Self-reported u...

work page 2010