Recognition: no theorem link
How people use Copilot for Health
Pith reviewed 2026-05-15 14:47 UTC · model grok-4.3
The pith
Analysis of over 500,000 health conversations shows nearly one in five involve personal symptom assessment or condition discussion.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a hierarchical intent taxonomy of 12 primary categories developed via privacy-preserving LLM-based classification and validated by expert human annotation, together with LLM-driven topic clustering, the study finds that nearly one in five conversations involve personal symptom assessment or condition discussion. The dominant general information category at 40 percent still concentrates on specific treatments and conditions. One in seven personal health queries concern someone other than the user. Personal symptom and emotional health queries increase in evening and nighttime hours. Usage diverges by device with mobile focused on personal concerns and desktop on professional work. A key
What carries the argument
The hierarchical intent taxonomy of 12 primary categories, built through privacy-preserving LLM classification validated against expert annotation and combined with LLM-driven topic clustering to group prevalent themes within each intent.
Load-bearing premise
The privacy-preserving LLM classification and topic clustering accurately reflect users' true intents and topics without systematic bias from model limits or de-identification.
What would settle it
A large manual annotation by health experts of a random sample of conversations that yields substantially different proportions across the 12 intent categories than the LLM results.
Figures
read the original abstract
We analyze over 500,000 de-identified health-related conversations with Microsoft Copilot from January 2026 to characterize what people ask conversational AI about health. We develop a hierarchical intent taxonomy of 12 primary categories using privacy-preserving LLM-based classification validated against expert human annotation, and apply LLM-driven topic-clustering for prevalent themes within each intent. Using this taxonomy, we characterize the intents and topics behind health queries, identify who these queries are about, and analyze how usage varies by device and time of day. Five findings stand out. First, nearly one in five conversations involve personal symptom assessment or condition discussion, and even the dominant general information category (40%) is concentrated on specific treatments and conditions, suggesting that this is a lower bound on personal health intent. Second, one in seven of these personal health queries concern someone other than the user, such as a child, a parent, a partner, suggesting that conversational AI can be a caregiving tool, not just a personal one. Third, personal queries about symptoms and emotional health queries increase markedly in the evening and nighttime hours, when traditional healthcare is most limited. Fourth, usage diverges sharply by device: mobile concentrates on personal health concerns, while desktop is dominated by professional and academic work. Fifth, a substantial share of queries focuses on navigating healthcare systems such as finding providers, and understanding insurance, highlighting friction in the delivery of existing healthcare. These patterns have direct implications for platform-specific design, safety considerations, and the responsible development of health AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes over 500,000 de-identified health-related conversations with Microsoft Copilot from January 2026. It develops a hierarchical intent taxonomy of 12 primary categories via privacy-preserving LLM-based classification validated against expert human annotation, applies LLM-driven topic clustering within each category, and characterizes intents, the subjects of queries (user vs. others), and variations by device and time of day. Five findings are highlighted: nearly one in five conversations involve personal symptom assessment or condition discussion (with the 40% general-information category treated as a lower bound on personal health intent); one in seven personal queries concern others (e.g., child, parent); personal symptom and emotional-health queries rise in evening/night hours; mobile usage concentrates on personal concerns while desktop usage is dominated by professional/academic work; and a substantial share addresses healthcare-system navigation such as finding providers and understanding insurance.
Significance. If the classification accuracy holds, the work supplies large-scale, real-world observational evidence on conversational-AI health use that is currently scarce. The scale of the dataset, the distinction between personal and proxy queries, the temporal and device-specific patterns, and the identification of healthcare-friction topics provide concrete inputs for platform design, safety guardrails, and policy discussions around health AI. The purely descriptive nature avoids circularity and supplies falsifiable prevalence estimates that future studies can replicate or refute.
major comments (1)
- [Abstract] Abstract and Methods (classification pipeline): The claim that the 12-category taxonomy was 'validated against expert human annotation' is load-bearing for the headline 19% personal-health figure and the lower-bound interpretation of the general-information category, yet the abstract supplies no quantitative agreement metrics (sample size, Cohen/Fleiss kappa, confusion matrix, or error analysis on de-identified text). Without these, it is impossible to bound the risk of systematic mislabeling of borderline personal queries, directly affecting the reported shares and the caregiving-tool interpretation.
minor comments (2)
- [Abstract] Abstract: The time window 'January 2026' post-dates the present; confirm whether this is a typographical error (e.g., 2024 or 2025) or an intended future projection.
- [Abstract] Abstract: The 12-category hierarchical taxonomy is referenced but neither enumerated nor linked to a table or figure; a brief listing or pointer would improve accessibility for readers who do not consult the full methods.
Simulated Author's Rebuttal
We thank the referee for their positive recommendation of minor revision and for highlighting the importance of transparent reporting on the classification validation. We address the major comment point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract and Methods (classification pipeline): The claim that the 12-category taxonomy was 'validated against expert human annotation' is load-bearing for the headline 19% personal-health figure and the lower-bound interpretation of the general-information category, yet the abstract supplies no quantitative agreement metrics (sample size, Cohen/Fleiss kappa, confusion matrix, or error analysis on de-identified text). Without these, it is impossible to bound the risk of systematic mislabeling of borderline personal queries, directly affecting the reported shares and the caregiving-tool interpretation.
Authors: We agree that the abstract would be strengthened by including key quantitative validation metrics to allow readers to assess reliability directly. The full details of the validation (annotation sample size of 500 conversations, Cohen's kappa of 0.82, per-category agreement rates, and error analysis on de-identified samples) are already reported in the Methods section. In the revised manuscript we will add a concise statement to the abstract summarizing these metrics (e.g., 'validated on 500 expert-annotated conversations with Cohen's kappa = 0.82'). This change improves transparency without altering the underlying findings or interpretations. revision: yes
Circularity Check
No circularity: purely descriptive observational analysis
full rationale
The paper is a purely observational study that applies an LLM-based classifier (validated by human annotation) to produce a 12-category taxonomy and then directly counts and describes the resulting category shares, topics, device differences, and temporal patterns across 500k conversations. No equations, fitted parameters, or derived predictions appear; the reported shares (e.g., 19% personal health, 40% general information) are simple empirical frequencies from the classified corpus rather than outputs of any model that was itself trained or constrained on those same frequencies. No self-citation chain is invoked to justify uniqueness or to close a definitional loop. The analysis is therefore self-contained against external benchmarks and receives a score of 0.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The 12-category hierarchical intent taxonomy comprehensively covers the space of health-related queries to Copilot.
- domain assumption LLM-driven classification and topic clustering produce labels that match human expert judgment at a level sufficient for the reported percentages.
Reference graph
Works this paper leans on
-
[1]
Reliability of LLMs as medical assistants for the general public: a randomized preregistered study
Bean, Andrew M et al. (2026). “Reliability of LLMs as medical assistants for the general public: a randomized preregistered study”. In:Nature Medicine, pp. 1–7
work page 2026
- [2]
-
[3]
Chatterji, Aaron et al. (Sept. 2025).How People Use ChatGPT. Working Paper 34255. National Bureau of Economic Research.doi:10.3386/w34255.url:http://www.nber.org/papers/w34255
work page doi:10.3386/w34255.url:http://www.nber.org/papers/w34255 2025
-
[4]
(2025).It’s About Time: The Temporal and Modal Dynamics of Copilot Usage
Costa-Gomes, Beatriz et al. (2025).It’s About Time: The Temporal and Modal Dynamics of Copilot Usage. arXiv:2512.11879 [cs.CY].url:https://arxiv.org/abs/2512.11879
-
[5]
Eysenbach, Gunther and Christian Köhler (2002). “How do consumers search for and appraise health information on the world wide web? Qualitative study using focus groups, usability tests, and in-depth interviews”. In:BMJ324.7337, pp. 573–577.issn: 0959-8138.doi: 10.1136/bmj.324.7337.573 . eprint: https://www.bmj.com/content/324/7337/573.full.pdf .url: http...
-
[6]
(2024).To do no harm—and the most good—with AI in health care
Goldberg, Carey Beth et al. (2024).To do no harm—and the most good—with AI in health care
work page 2024
-
[7]
Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures
Golder, Scott A. and Michael W. Macy (2011). “Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures”. In:Science333.6051, pp. 1878–1881.doi: 10 . 1126 / science.1202775. eprint: https://www.science.org/doi/pdf/10.1126/science.1202775 .url: https://www.science.org/doi/abs/10.1126/science.1202775
-
[8]
Large language models for chatbot health advice studies: a systematic review
Huo, Bright et al. (2025). “Large language models for chatbot health advice studies: a systematic review”. In:JAMA Network Open8.2, e2457879
work page 2025
-
[9]
Kung, Tiffany H. et al. (Feb. 2023). “Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models”. In:PLOS Digital Health2.2, pp. 1–12.doi:10. 1371/journal.pdig.0000198.url:https://doi.org/10.1371/journal.pdig.0000198
-
[10]
Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine
Lee, Peter, Sebastien Bubeck, and Joseph Petro (2023). “Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine”. In:New England Journal of Medicine388.13, pp. 1233–1239.doi: 10.1056/NEJMsr2214184. eprint: https://www.nejm.org/doi/pdf/10.1056/NEJMsr2214184.url: https://www.nejm.org/doi/full/10.1056/NEJMsr2214184. Lizée, Antoine et al. (2024). “...
-
[11]
(June 26, 2025).How People Use Claude for Support, Advice, and Companionship
McCain, Miles et al. (June 26, 2025).How People Use Claude for Support, Advice, and Companionship. url: https://www.anthropic.com/news/how-people-use-claude-for-support-advice-and- companionship
work page 2025
-
[12]
Sequential Diagnosis with Language Models
Nori, Harsha, Mayank Daswani, et al. (2025). “Sequential Diagnosis with Language Models”. In: arXiv: 2506.22405 [cs.CL].url:https://arxiv.org/abs/2506.22405
-
[13]
Capabilities of GPT-4 on Medical Challenge Problems
Nori, Harsha, Nicholas King, et al. (2023). “Capabilities of GPT-4 on Medical Challenge Problems”. In: arXiv:2303.13375 [cs.CL].url:https://arxiv.org/abs/2303.13375
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond
Nori, Harsha, Naoto Usuyama, et al. (2024). “From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond”. In: arXiv:2411.03590 [cs.CL].url: https://arxiv. org/abs/2411.03590
-
[15]
ChatGPT Health performance in a structured test of triage recom- mendations
Ramaswamy, Ashwin et al. (2026). “ChatGPT Health performance in a structured test of triage recom- mendations”. In:Nature Medicine
work page 2026
-
[16]
What is Artificial Intelligence (AI)“Empathy
Ruben, Mollie A, Danielle Blanch-Hartigan, and Judith A Hall (2025). “What is Artificial Intelligence (AI)“Empathy”? A Study Comparing ChatGPT and Physician Responses on an Online Forum: Ruben et al.” In:Journal of General Internal Medicine, pp. 1–8
work page 2025
-
[17]
Singhal, Karan et al. (Aug. 2023). “Large language models encode clinical knowledge”. In:Nature620.7972, pp. 172–180.issn: 1476-4687.doi: 10.1038/s41586- 023- 06291- 2.url: https://doi.org/10. 1038/s41586-023-06291-2
-
[18]
Internet Health Information Seeking and the Patient-Physician Relationship: A Systematic Review
Tan, Sharon Swee-Lin and Nadee Goonawardene (Jan. 2017). “Internet Health Information Seeking and the Patient-Physician Relationship: A Systematic Review”. In:J Med Internet Res19.1, e9.issn: 1438-8871.doi:10.2196/jmir.5729.url:http://www.ncbi.nlm.nih.gov/pubmed/28104579. 11
work page doi:10.2196/jmir.5729.url:http://www.ncbi.nlm.nih.gov/pubmed/28104579 2017
-
[19]
High-performance medicine: the convergence of human and artificial intelli- gence
Topol, Eric J. (Jan. 2019). “High-performance medicine: the convergence of human and artificial intelli- gence”. In:Nature Medicine25.1, pp. 44–56.issn: 1546-170X.doi:10.1038/s41591-018-0300-7. url:https://doi.org/10.1038/s41591-018-0300-7
-
[20]
Tnt-llm: Text mining at scale with large language models
Wan, Mengting et al. (2024). “Tnt-llm: Text mining at scale with large language models”. In:Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pp. 5836–5847. 12
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.