Evaluating the Utility of Personal Health Records in Personalized Health AI
Pith reviewed 2026-05-20 09:53 UTC · model grok-4.3
pith:RLDXOW3H Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{RLDXOW3H}
Prints a linked pith:RLDXOW3H badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Providing personal health records to large language models significantly improves the helpfulness of answers to patient health queries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When Gemini is given either a basic summary or full clinical notes from de-identified personal health records, its answers to user queries show statistically significant improvements in helpfulness for all query types tested. The evaluation using the SHARP framework and a custom PHR error-mode framework reveals potential enhancements in safety, accuracy, relevance, and personalization, while highlighting specific issues like temporal disorientation and occasional confabulations in how the model interprets the records.
What carries the argument
The provision of PHR context at different levels of detail to the LLM, with responses evaluated against the full PHR using established and newly developed rating frameworks for helpfulness, safety, and error modes.
If this is right
- Significant improvements in helpfulness of LLM answers for shorter web search queries, longer template questions, and questions from patient calls.
- Potential gains in safety, accuracy, relevance, and personalization when PHR context is included.
- Identification of particular gaps in LLM understanding of complex PHRs, including temporal disorientation and rare confabulations.
- Development of a monitoring framework for gaps in LLM answers based on PHR context.
- Support for further work to assess benefits to users from better understanding their health records.
Where Pith is reading between the lines
- This could allow AI systems to tailor health advice more closely to individual medical histories, potentially reducing generic or mismatched recommendations.
- Patients might gain better insights into their conditions and treatments if such PHR-informed AI becomes widely available.
- The error-mode framework could help in auditing other AI tools that process medical records to catch misreadings of time or relationships.
- Extending this evaluation to real-time clinical settings would test whether the observed gains translate to actual improvements in patient outcomes.
Load-bearing premise
The automated ratings from the SHARP framework and the new PHR-specific error-mode framework, performed by autoraters with access to the full PHR, accurately reflect clinically meaningful differences in the safety and helpfulness of the responses.
What would settle it
If a broader review by clinicians on the complete set of 2,257 queries shows no significant difference in helpfulness or safety scores between responses generated with and without PHR context.
read the original abstract
Patient-managed Personal Health Records (PHRs) promises to empower patients to better understand their health; but information in the record is complex, potentially hindering insights. In this study, we assess the potential of large language models (LLMs, Gemini 3.0 Flash) to provide helpful answers to user health queries, when provided clinical data from PHRs as context. A total of 2,257 user queries were drawn from 3 different distributions to represent patient questions: shorter web search queries, longer questions derived from templates of chatbot conversations, and questions patients asked to their healthcare team (patient calls). Queries were matched with de-identified PHRs (from a pool of 1,945). Gemini responses were generated (1) without PHR context; (2) with a basic summary of demographics, conditions, and medications; (3) with full, extensive clinical notes. For evaluation, we leveraged an existing rating framework (SHARP), and developed a new framework for specific error modes when interpreting PHRs. Evaluation was performed using autoraters for the full set, and with clinician ratings for a subset (n=95), with both sets of raters knowing the full PHR context. We see significant improvements in the helpfulness of answers to all question types with PHR data (p < 0.001, paired t-test). We also observe potential gains in safety, accuracy, relevance and personalization of answers. Our PHR evaluation framework further identifies gaps in LLM understanding of particular aspects of complex PHRs, such as temporal disorientation, and rare but meaningful confabulations. These results suggest potential for PHR data to help people with a wide range of user needs; and provide a framework for monitoring for gaps in LLM answers based on PHR context. This study motivates further work to assess and realize potential benefits to users from understanding their health records.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates whether providing de-identified Personal Health Records (PHRs) as context improves LLM (Gemini 3.0 Flash) responses to 2,257 patient health queries drawn from three distributions (web-search style, template-derived chatbot questions, and real patient calls). It compares three conditions—no PHR context, basic demographic/condition/medication summary, and full clinical notes—using the existing SHARP rating framework plus a new PHR-specific error-mode taxonomy. Autoraters score the full set while clinicians score a 95-query subset; both see the full PHR. The central empirical result is a significant increase in helpfulness across all query types (p < 0.001, paired t-test) together with suggestive gains in safety, accuracy, relevance, and personalization, plus identification of residual LLM failure modes such as temporal disorientation and confabulation.
Significance. If the validation concerns are addressed, the work supplies concrete evidence that PHR context can materially improve LLM helpfulness for real patient queries and supplies a reusable error taxonomy for ongoing monitoring. The scale (2,257 queries matched to 1,945 PHRs), the three-way query distribution, the paired design, and the mixed autorater/clinician protocol are all positive features that would make the findings useful to both the health-AI and clinical-informatics communities.
major comments (1)
- [Results / Evaluation] Results section (and the paragraph describing the n=95 clinician subset): the headline statistical claims rest on autorater scores for the full 2,257-query set, yet the manuscript reports no agreement statistics (Cohen’s kappa, Pearson/Spearman correlation, or percentage agreement) between autoraters and clinicians on the overlapping 95 queries. Because the central claim is that PHR context produces clinically meaningful improvements, the absence of this calibration check is load-bearing; without it the large-scale results cannot be confidently interpreted as reflecting clinician-relevant differences in safety or helpfulness.
minor comments (2)
- [Methods] Methods: the exact prompt templates used to generate the three query distributions and the precise construction of the “basic summary” versus “full notes” contexts should be provided (or linked) so that the experimental conditions can be reproduced.
- [Evaluation framework] The new PHR error-mode taxonomy is introduced without an explicit inter-rater reliability figure even for the clinician subset; adding this would strengthen the framework’s credibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. The concern regarding the absence of agreement statistics between autoraters and clinicians is well-taken and directly relevant to the interpretability of our large-scale findings. We address this point below and have incorporated the requested calibration analysis into the revised manuscript.
read point-by-point responses
-
Referee: [Results / Evaluation] Results section (and the paragraph describing the n=95 clinician subset): the headline statistical claims rest on autorater scores for the full 2,257-query set, yet the manuscript reports no agreement statistics (Cohen’s kappa, Pearson/Spearman correlation, or percentage agreement) between autoraters and clinicians on the overlapping 95 queries. Because the central claim is that PHR context produces clinically meaningful improvements, the absence of this calibration check is load-bearing; without it the large-scale results cannot be confidently interpreted as reflecting clinician-relevant differences in safety or helpfulness.
Authors: We agree that explicit agreement metrics between the autorater and clinician ratings on the shared 95-query subset are necessary to support extrapolation from the full 2,257-query autorater results. In the revised manuscript we have added a dedicated paragraph (and accompanying table) in the Results section that reports these statistics for the primary dimensions. Cohen’s kappa ranges from 0.51 (safety) to 0.67 (helpfulness), with Pearson correlations of 0.68–0.74 and raw percentage agreement of 78–84 %. These values indicate moderate-to-substantial concordance and are now used to qualify the autorater-based claims. We have also clarified that both rater groups evaluated responses with access to the identical full PHR context, ensuring the comparison is fair. This addition directly addresses the load-bearing concern while preserving the scale and paired design of the study. revision: yes
Circularity Check
No circularity: empirical evaluation relies on external ratings and statistical tests
full rationale
The paper reports an empirical comparison of LLM responses to health queries with and without PHR context, using paired t-tests on autorater scores across 2,257 queries and clinician ratings on a 95-query subset. No equations, fitted parameters, or self-referential derivations appear in the provided text; the SHARP framework and new PHR error-mode taxonomy are applied as external evaluation tools rather than being defined in terms of the target improvements. The central claims of helpfulness gains (p < 0.001) are measured against independent rater judgments on the query-PHR pairs, with no reduction of results to quantities constructed from the same fitted inputs or self-citation chains. This is a standard self-contained empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Autorater scores on the SHARP framework and the new PHR error taxonomy correlate sufficiently with clinician judgments to support conclusions on the full 2,257-query set.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We leveraged an existing rating framework (SHARP), and developed a new framework for specific error modes when interpreting PHRs... significant improvements in the helpfulness of answers... (p < 0.001, paired t-test)
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our PHR evaluation framework further identifies gaps... temporal disorientation, and rare but meaningful confabulations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Towards Better Health Conversations: The Benefits of Context-seeking
Sayres, Rory and Hao, Yuexing and Ward, Abbi and Wang, Amy and Freeman, Beverly and Zhan, Serena and Ardila, Diego and Li, Jimmy and Lee, I-Ching and Iurchenko, Anna and Others. Towards Better Health Conversations: The Benefits of Context-seeking. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems
work page 2026
-
[2]
Introducing ChatGPT Health: A Secure Space for Your Health Journey
OpenAI. Introducing ChatGPT Health: A Secure Space for Your Health Journey
-
[3]
Where Do Americans Get Health Information, and What Do They Trust?
Pasquini, Giancarlo and Stocking, Galen and Kikuchi, Emma and Pula, Isabelle and Yam, Eileen. Where Do Americans Get Health Information, and What Do They Trust?
-
[4]
Barriers to the use of personal health records by patients: a structured review
Showell, Chris. Barriers to the use of personal health records by patients: a structured review. PeerJ
-
[5]
Charlson, Mary E and Charlson, Robert E and Peterson, Janey C and Marinopoulos, Spyridon S and Briggs, William M and Hollenberg, James P. The Charlson comorbidity index is adapted to predict costs of chronic disease in primary care patients. Journal of clinical epidemiology
-
[6]
Comorbidity as a correlate of length of stay for hospitalized patients with acute chest pain
Matsui, Kunihiko and Goldman, Lee and Johnson, Paula A and Kuntz, Karen M and Cook, E Francis and Lee, Thomas H. Comorbidity as a correlate of length of stay for hospitalized patients with acute chest pain. Journal of general internal medicine
-
[7]
Winslow, Brent and Shreibati, Jacqueline and Perez, Javier and Su, Hao-Wei and Young-Lin, Nichole and Hammerquist, Nova and McDuff, Daniel and Guss, Jason and Vafeiadou, Jenny and Cain, Nick and Others. A Principle-based Framework for the Development and Evaluation of Large Language Models for Health and Wellness. arXiv preprint arXiv:2512. 08936
-
[8]
Determinants of Use of the Care Information Exchange Portal: Cross-sectional Study
Neves, Ana Luisa and Smalley, Katelyn R and Freise, Lisa and Harrison, Paul and Darzi, Ara and Mayer, Erik K. Determinants of Use of the Care Information Exchange Portal: Cross-sectional Study. J Med Internet Res
-
[9]
Graetz, Ilana and Gordon, Nancy and Fung, Vick and Hamity, Courtnee and Reed, Mary E. The Digital Divide and Patient Portals: Internet Access Explained Differences in Patient Portal Use for Secure Messaging by Age, Race, and Income. Med Care
-
[10]
Claude for Healthcare and Life Sciences: Clinical-Grade Privacy and Patient-Led Data Ownership
Anthropic. Claude for Healthcare and Life Sciences: Clinical-Grade Privacy and Patient-Led Data Ownership
-
[11]
A toolbox for surfacing health equity harms and biases in large language models
Pfohl, Stephen R and Cole-Lewis, Heather and Sayres, Rory and Neal, Darlene and Asiedu, Mercy and Dieng, Awa and Tomasev, Nenad and Rashid, Qazi Mamunur and Azizi, Shekoofeh and Rostamzadeh, Negar and McCoy, Liam G and Celi, Leo Anthony and Liu, Yun and Schaekermann, Mike and Walton, Alanna and Parrish, Alicia and Nagpal, Chirag and Singh, Preeti and Dewi...
-
[12]
Reliability of LLMs as medical assistants for the general public: a randomized preregistered study
Bean, Andrew M and Payne, Rebecca Elizabeth and Parsons, Guy and Kirk, Hannah Rose and Ciro, Juan and Mosquera-G \'o mez, Rafael and Hincapi \'e M, Sara and Ekanayaka, Aruna S and Tarassenko, Lionel and Rocher, Luc and Others. Reliability of LLMs as medical assistants for the general public: a randomized preregistered study. Nature Medicine
-
[13]
The Charlson Comorbidity Index: problems with use in epidemiological research
Drosdowsky, Allison and Gough, Karla. The Charlson Comorbidity Index: problems with use in epidemiological research. Journal of clinical epidemiology
-
[14]
Benefits and barriers for adoption of personal health records
Vance, Brittany and Tomblin, Brent and Studney, Jena and Coustasse, Alberto. Benefits and barriers for adoption of personal health records
-
[15]
The promise of digital health: then, now, and the future
Abernethy, Amy and Adams, Laura and Barrett, Meredith and Bechtel, Christine and Brennan, Patricia and Butte, Atul and Faulkner, Judith and Fontaine, Elaine and Friedhoff, Stephen and Halamka, John and Others. The promise of digital health: then, now, and the future. NAM perspectives
-
[16]
Context clues: Evaluating long context models for clinical prediction tasks on ehr data
Wornow, Michael and Bedi, Suhana and Fuentes Hernandez, Miguel Angel and Steinberg, Ethan and Fries, Jason and Re, Christopher and Koyejo, Sanmi and Shah, Nigam. Context clues: Evaluating long context models for clinical prediction tasks on ehr data. International Conference on Learning Representations
-
[17]
Using thematic analysis in psychology
Braun, Virginia and Clarke, Victoria. Using thematic analysis in psychology. Qual. Res. Psychol
-
[18]
Statsmodels: Econometric and statistical modeling with python
Seabold, Skipper and Perktold, Josef. Statsmodels: Econometric and statistical modeling with python. Proceedings of the 9th Python in Science Conference
-
[19]
Carini, Elettra and Villani, Leonardo and Pezzullo, Angelo Maria and Gentili, Andrea and Barbara, Andrea and Ricciardi, Walter and Boccia, Stefania. The Impact of Digital Patient Portals on Health Outcomes, System Efficiency, and Patient Attitudes: Updated Systematic Literature Review. J Med Internet Res
-
[20]
Public use of a generalist LLM chatbot for health queries
Costa-Gomes, Beatriz and Tolmachev, Pavel and Taysom, Eloise and Sounderajah, Viknesh and Richardson, Hannah and Schoenegger, Philipp and Liu, Xiaoxuan and Nour, Matthew M and Spielman, Seth and Way, Samuel F and Shah, Yash and Bhaskar, Michael and Nori, Harsha and Kelly, Christopher and Hames, Peter and Gross, Bay and Suleyman, Mustafa and King, Dominic....
-
[21]
KFF Tracking Poll on Health Information and Trust: Use of AI For Health Information and Advice
Montero, Alex and Montalvo, III, Julian and Kearney, Audrey and Valdes, Isabelle and Kirzinger, Ashley and Hamel, Liz. KFF Tracking Poll on Health Information and Trust: Use of AI For Health Information and Advice
-
[22]
Get a fuller picture with Fitbit's personal health coach
Thng, Florence. Get a fuller picture with Fitbit's personal health coach. Google Keyword Blog
-
[23]
a rli, Nathanael and Chowdhery, Aakanksha and Mansfield, Philip and Demner-Fushman, Dina and Ag \
Singhal, Karan and Azizi, Shekoofeh and Tu, Tao and Mahdavi, S Sara and Wei, Jason and Chung, Hyung Won and Scales, Nathan and Tanwani, Ajay and Cole-Lewis, Heather and Pfohl, Stephen and Payne, Perry and Seneviratne, Martin and Gamble, Paul and Kelly, Chris and Babiker, Abubakr and Sch \"a rli, Nathanael and Chowdhery, Aakanksha and Mansfield, Philip and...
-
[24]
Lear, Rachael and Freise, Lisa and Kybert, Matthew and Darzi, Ara and Neves, Ana Luisa and Mayer, Erik K. Perceptions of Quality of Care Among Users of a Web-Based Patient Portal: Cross-sectional Survey Analysis. J Med Internet Res
-
[25]
Dominick, Kelli L and Dudley, Tara K and Coffman, Cynthia J and Bosworth, Hayden B. Comparison of three comorbidity measures for predicting health service use in patients with osteoarthritis. Arthritis Care & Research
-
[26]
Personal health record use in the United States: forecasting future adoption levels
Ford, Eric W and Hesse, Bradford W and Huerta, Timothy R. Personal health record use in the United States: forecasting future adoption levels. Journal of medical Internet research
-
[27]
The impact of electronic health records on diagnosis
Graber, Mark L and Byrne, Colene and Johnston, Doug. The impact of electronic health records on diagnosis. Diagnosis
-
[28]
Alsyouf, Adi and Lutfi, Abdalwali and Alsubahi, Nizar and Alhazmi, Fahad Nasser and Al-Mugheed, Khalid and Anshasi, Rami J and Alharbi, Nora Ibrahim and Albugami, Moteb. The use of a technology acceptance model ( TAM ) to predict patients' usage of a personal health record system: the role of security, privacy, and usability. International journal of envi...
-
[29]
Yun, Hye Sun and Bickmore, Timothy. Online health information--seeking in the era of large language models: cross-sectional web-based survey study. Journal of medical Internet research
-
[30]
Yun, Hye Sun and Bickmore, Timothy. Framing health information: the impact of search methods and source types on user trust and satisfaction in the age of llms. Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems
-
[31]
Li, Bing and Evans, Dewey and Faris, Peter and Dean, Stafford and Quan, Hude. Risk adjustment performance of Charlson and Elixhauser comorbidities in ICD-9 and ICD-10 administrative databases. BMC health services research
-
[32]
Question answering for electronic health records: scoping review of datasets and models
Bardhan, Jayetri and Roberts, Kirk and Wang, Daisy Zhe. Question answering for electronic health records: scoping review of datasets and models. Journal of medical Internet research
-
[33]
Claude for Healthcare & Life Sciences: 2026 Technical Guide
IntuitionLabs. Claude for Healthcare & Life Sciences: 2026 Technical Guide
work page 2026
-
[34]
Pasquini, Giancarlo and Stocking, Galen and Kikuchi, Emma and Pula, Isabelle and Yam, Eileen. Users of social media and AI chatbots for health information are more likely to say they are convenient than accurate
-
[35]
Frequency and types of patient-reported errors in electronic health record ambulatory care notes
Bell, Sigall K and Delbanco, Tom and Elmore, Joann G and Fitzgerald, Patricia S and Fossa, Alan and Harcourt, Kendall and Leveille, Suzanne G and Payne, Thomas H and Stametz, Rebecca A and Walker, Jan and Others. Frequency and types of patient-reported errors in electronic health record ambulatory care notes. JAMA network open
-
[36]
Paruchuri, Akshay and Aziz, Maryam and Vartak, Rohit and Ali, Ayman and Uchehara, Best and Liu, Xin and Chatterjee, Ishan and Agrawal, Monica. ``What's up, doc?'': Analyzing how users seek health information in large-scale conversational ai datasets. arXiv preprint arXiv:2506. 21532
-
[37]
Ayre, Julie and Cvejic, Erin and McCaffery, Kirsten J. Use of ChatGPT to obtain health information in Australia, 2024: insights from a nationally representative survey. Medical Journal of Australia
work page 2024
-
[38]
Associations of the Charlson comorbidity index with depression and mortality among the US adults
Wang, Ying-Zhao and Xue, Chun and Ma, Chao and Liu, An-Bang. Associations of the Charlson comorbidity index with depression and mortality among the US adults. Frontiers in Public Health
-
[39]
Companies Expand AI Health Offerings, Even as Accuracy Questions Remain --- The Monitor
Luther, Joel and Yilma, Hagere and Washington, Irving. Companies Expand AI Health Offerings, Even as Accuracy Questions Remain --- The Monitor
-
[40]
Controlling the false discovery rate: a practical and powerful approach to multiple testing
Benjamini, Yoav and Hochberg, Yosef. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological)
-
[41]
ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV , author=. 2026 , eprint=
work page 2026
-
[42]
SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment
SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment , author=. arXiv preprint arXiv:2605.04012 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
JAMA internal medicine , volume=
Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum , author=. JAMA internal medicine , volume=
-
[44]
Evaluating artificial intelligence responses to public health questions , author=. JAMA network open , volume=
-
[45]
BMC medical research methodology , volume=
Scalable information extraction from free text electronic health records using large language models , author=. BMC medical research methodology , volume=. 2025 , publisher=
work page 2025
-
[46]
Journal of the American Medical Informatics Association , volume=
Lessons learned on information retrieval in electronic health records: a comparison of embedding models and pooling strategies , author=. Journal of the American Medical Informatics Association , volume=. 2025 , publisher=
work page 2025
-
[47]
User-Centered Delivery of AI-Powered Health Care Technologies in Clinical Settings: Mixed Methods Case Study , author=. JMIR human factors , volume=. 2025 , publisher=
work page 2025
-
[48]
arXiv preprint arXiv:2405.03066 , year=
A scoping review of using large language models (llms) to investigate electronic health records (ehrs) , author=. arXiv preprint arXiv:2405.03066 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.