arxiv: 2604.18302 · v1 · submitted 2026-04-20 · 💻 cs.AI

Recognition: unknown

Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support

Anita H. Clayton, Asanga Gunaratna, Atmaram Yarlagadda, Christopher K. Rhea, Eranga Bandara, Isurunima Kularathna, Preston Samuel, Ravi Mukkamala, Ross Gore, Sachini Rajapakse, Sachin Shetty, Xueping Liang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:15 UTC · model grok-4.3

classification 💻 cs.AI

keywords on-device AIpsychiatric diagnosisprivacy-preservingmental healthLLM deploymentmobile applicationDSM-5 assessmentzero-egress

0 comments

The pith

A mobile app runs fine-tuned LLMs entirely locally to deliver psychiatric assessments without any patient data leaving the device.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that psychiatric decision support can shift from cloud servers to on-device execution using a consortium of compact, quantized language models. This matters because current cloud pipelines force sensitive mental health data to exit the device, which can discourage people from seeking help in military, correctional, or remote settings. The system orchestrates local inference to generate DSM-5-aligned outputs for clinicians and patients while keeping all processing on the phone. If the accuracy holds, the approach removes a major barrier to AI use in environments that reject external data transmission.

Core claim

The work presents a zero-egress cross-platform mobile application that integrates three lightweight, fine-tuned, and quantized open-source LLMs coordinated by an on-device orchestration layer to perform ensemble inference and consensus-based diagnostic reasoning, yielding DSM-5-aligned assessments for differential diagnosis and symptom mapping with accuracy comparable to the server-side version and real-time latency on commodity hardware.

What carries the argument

An on-device orchestration layer that coordinates ensemble inference and consensus-based diagnostic reasoning among three quantized LLMs.

If this is right

Clinicians gain local access to differential diagnosis support and evidence-linked symptom mapping without data transmission.
Patients can use self-screening features with built-in safeguards while data remains on-device.
The platform becomes usable in operational environments that prohibit any external data flow.
Real-time performance is maintained on standard mobile hardware rather than requiring specialized servers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The design could extend to other privacy-sensitive medical domains by swapping the diagnostic focus while retaining the local orchestration layer.
Long-term, repeated local use might allow the models to adapt to individual users through on-device updates without cloud involvement.
Testing the system on diverse populations would reveal whether quantization introduces biases in specific demographic groups.

Load-bearing premise

The quantized and fine-tuned models retain enough diagnostic fidelity for DSM-5 assessments to match the accuracy of their full server-side versions.

What would settle it

A direct comparison of diagnostic outputs from the on-device system versus the server version on the same set of clinical cases, measuring agreement rates and specific error types.

Figures

Figures reproduced from arXiv: 2604.18302 by Anita H. Clayton, Asanga Gunaratna, Atmaram Yarlagadda, Christopher K. Rhea, Eranga Bandara, Isurunima Kularathna, Preston Samuel, Ravi Mukkamala, Ross Gore, Sachini Rajapakse, Sachin Shetty, Xueping Liang.

**Figure 2.** Figure 2: End-to-end offline fine-tuning and on-device deployment pipeline. The [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

**Figure 3.** Figure 3: On-device ensemble inference flow. A clinical conversation is captured via the [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

**Figure 4.** Figure 4: AI model selection panel in the mobile application. Three modes are available: [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

**Figure 5.** Figure 5: Home screen of the mobile application, subtitled [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗

**Figure 6.** Figure 6: Conversational interface (AI Session screen) of the mobile application. The [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗

**Figure 7.** Figure 7: SOAP Notes task flow. Upon receiving a Generate SOAP note request, the platform returns a structured intake prompt requesting four categories of clinical information: (1) Subjective — patient complaints, symptoms, and history; (2) Objective — vital signs, physical examination findings, lab results, and imaging; (3) Assessment — clinical impressions and differential diagnoses; and (4) Plan — proposed treat… view at source ↗

read the original abstract

Privacy represents one of the most critical yet underaddressed barriers to AI adoption in mental healthcare -- particularly in high-sensitivity operational environments such as military, correctional, and remote healthcare settings, where the risk of patient data exposure can deter help-seeking behavior entirely. Existing AI-enabled psychiatric decision support systems predominantly rely on cloud-based inference pipelines, requiring sensitive patient data to leave the device and traverse external servers, creating unacceptable privacy and security risks in these contexts. In this paper, we propose a zero-egress, on-device AI platform for privacy-preserving psychiatric decision support, deployed as a cross-platform mobile application. The proposed system extends our prior work on fine-tuned LLM consortiums for psychiatric diagnosis standardization by fundamentally re-architecting the inference pipeline for fully local execution -- ensuring that no patient data is transmitted to, processed by, or stored on any external server at any stage. The platform integrates a consortium of three lightweight, fine-tuned, and quantized open-source LLMs -- Gemma, Phi-3.5-mini, and Qwen2 -- selected for their compact architectures and proven efficiency on resource-constrained mobile hardware. An on-device orchestration layer coordinates ensemble inference and consensus-based diagnostic reasoning, producing DSM-5-aligned assessments for conditions. The platform is designed to assist clinicians with differential diagnosis and evidence-linked symptom mapping, as well as to support patient-facing self-screening with appropriate clinical safeguards. Initial evaluation demonstrates that the proposed zero-egress deployment achieves diagnostic accuracy comparable to its server-side predecessor while sustaining real-time inference latency on commodity mobile hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical extension of prior cloud LLM work to zero-egress mobile psychiatric screening, but the accuracy comparability claim has no supporting numbers or test details.

read the letter

This paper describes moving a psychiatric LLM consortium to fully on-device operation so patient data stays local, but it does not include any numbers or details to support the claim that accuracy holds up after quantization. The new part is the re-architecture for zero-egress mobile use, building on their prior cloud work with an ensemble of Gemma, Phi-3.5-mini, and Qwen2 plus an orchestration layer for consensus reasoning. It targets practical barriers in places like military or remote care where sending data is a non-starter. The setup is described clearly enough, with attention to cross-platform deployment and safeguards for patient-facing use. That part shows solid engineering thinking about constraints on mobile hardware. The soft spot is the evaluation. It states that initial tests show comparable diagnostic accuracy and real-time performance, yet no metrics, datasets, baselines, or error breakdowns appear anywhere. Without those, the central promise cannot be checked, and the risk that lighter models lose fidelity on nuanced psychiatric distinctions stays open. The work is an application of known techniques rather than a new framework, so it will not change how people think about LLM deployment in general. Readers who build or review mobile AI systems for healthcare will get the most from it, especially those dealing with privacy regulations or high-stakes environments. It deserves a serious referee because the privacy angle matters and the implementation is grounded, even if the evidence for performance needs to be supplied. I recommend sending it for peer review with a request for the missing quantitative results and test protocol.

Referee Report

2 major / 2 minor

Summary. The paper proposes a zero-egress on-device AI platform for privacy-preserving psychiatric decision support, implemented as a cross-platform mobile application. It extends prior work on fine-tuned LLM consortiums by deploying an ensemble of three quantized lightweight models (Gemma, Phi-3.5-mini, Qwen2) with an on-device orchestration layer for consensus-based, DSM-5-aligned diagnostic reasoning. The system targets differential diagnosis assistance and patient self-screening in sensitive settings (military, correctional, remote care) while ensuring no patient data leaves the device. The abstract asserts that initial evaluation shows diagnostic accuracy comparable to the server-side predecessor alongside real-time inference latency on commodity mobile hardware.

Significance. If the accuracy and latency claims are substantiated, the work would offer a concrete technical path to address a major adoption barrier for AI in mental healthcare: the privacy risk of data egress in high-stakes environments. Demonstrating a practical, fully local ensemble deployment using open-source models could enable safer clinician tools and self-screening applications without external servers, potentially increasing help-seeking behavior. The emphasis on consensus reasoning and clinical safeguards adds operational relevance beyond pure model compression.

major comments (2)

[Abstract] Abstract: The central claim that 'initial evaluation demonstrates that the proposed zero-egress deployment achieves diagnostic accuracy comparable to its server-side predecessor' is presented without any quantitative metrics (accuracy, F1, Cohen's kappa), dataset description (size, conditions, ground-truth source), evaluation protocol, baselines, or error analysis. This absence leaves the primary performance assertion unsupported and prevents assessment of whether quantization and mobile constraints degrade DSM-5 diagnostic fidelity.
[Abstract] Evaluation/Results section (implied by abstract claim): No details are provided on how the on-device ensemble was tested against the server-side predecessor, including any ablation on quantization effects, inter-rater agreement with clinicians, or statistical significance of the 'comparable' result. Without these, the claim that the architecture preserves diagnostic quality cannot be evaluated and is load-bearing for the paper's contribution.

minor comments (2)

[Architecture] The description of the orchestration layer and consensus mechanism would benefit from a high-level diagram or pseudocode to clarify how the three models coordinate DSM-5 symptom mapping and differential diagnosis.
[Introduction] Explicit citation to the prior server-side LLM consortium paper should be added in the introduction to clearly delineate the novel on-device re-architecture from the earlier fine-tuning work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify that the abstract's performance claim requires quantitative substantiation to be evaluable. We will revise the manuscript to add a dedicated Evaluation section with metrics, dataset details, ablations, and analysis, while updating the abstract accordingly. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'initial evaluation demonstrates that the proposed zero-egress deployment achieves diagnostic accuracy comparable to its server-side predecessor' is presented without any quantitative metrics (accuracy, F1, Cohen's kappa), dataset description (size, conditions, ground-truth source), evaluation protocol, baselines, or error analysis. This absence leaves the primary performance assertion unsupported and prevents assessment of whether quantization and mobile constraints degrade DSM-5 diagnostic fidelity.

Authors: We agree that the abstract claim is currently unsupported without supporting numbers and context. In revision we will expand the abstract to report key quantitative results (e.g., accuracy, F1, Cohen's kappa) and will add a new Evaluation section that fully describes the test dataset (size, conditions, ground-truth source), protocol, baselines, and error analysis so readers can assess any impact of quantization and on-device constraints. revision: yes
Referee: [Abstract] Evaluation/Results section (implied by abstract claim): No details are provided on how the on-device ensemble was tested against the server-side predecessor, including any ablation on quantization effects, inter-rater agreement with clinicians, or statistical significance of the 'comparable' result. Without these, the claim that the architecture preserves diagnostic quality cannot be evaluated and is load-bearing for the paper's contribution.

Authors: We accept this assessment. The current manuscript presents only a high-level claim. We will add a full Results/Evaluation section containing ablation studies on quantization, inter-rater agreement (Cohen's kappa) with clinicians and the server-side model, statistical significance tests, and error analysis. These additions will directly substantiate the comparability claim and allow evaluation of diagnostic fidelity preservation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architecture and claims are self-contained

full rationale

The paper presents an engineering description of an on-device LLM ensemble for psychiatric decision support, extending prior fine-tuned models via re-architecting for local execution. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. The accuracy comparability claim is asserted from 'initial evaluation' without reducing by construction to the inputs or prior work; it is an evidentiary assertion rather than a self-referential loop. Self-citation of the authors' earlier LLM consortium work is present but does not bear load on any derivation chain, as the deployment pipeline stands independently. This matches the default expectation of no circularity for descriptive systems papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on the untested premise that existing open-source LLMs, after fine-tuning and quantization, can produce clinically reliable DSM-5 outputs when orchestrated locally; no new entities or free parameters are introduced in the abstract.

axioms (2)

domain assumption Lightweight open-source LLMs can be fine-tuned to produce DSM-5-aligned psychiatric assessments
Invoked when selecting Gemma, Phi-3.5-mini, and Qwen2 for the consortium
domain assumption Quantization preserves sufficient diagnostic accuracy for clinical decision support
Required for the on-device performance claim

pith-pipeline@v0.9.0 · 5638 in / 1350 out tokens · 23888 ms · 2026-05-10T05:15:59.137640+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction
cs.CL 2026-05 unverdicted novelty 5.0

Small open-weight language models can self-optimize prompts for clinical named entity recognition in dental notes, reaching micro F1 of 0.864 after DPO on Qwen2.5-14B.

Reference graph

Works this paper leans on

72 extracted references · 55 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

WHO global air quality guidelines: particulate matter (PM2.5 and PM10), ozone, nitrogen dioxide, sulfur dioxide and carbon monoxide.https://www

World Health Organization, World mental health report: Transforming mental health for all, Tech. rep., WHO, Geneva, Switzerland (2022). URLhttps://www.who.int/publications/i/item/9789240049338

work page arXiv 2022
[2]

Saxena, G

S. Saxena, G. Thornicroft, M. Knapp, H. Whiteford, Resources for men- tal health: scarcity, inequity, and inefficiency, The Lancet 370 (9590) (2007) 878–889. doi:10.1016/S0140-6736(07)61239-2

work page doi:10.1016/s0140-6736(07)61239-2 2007
[3]

Thornicroft, et al., Undertreatment of people with major depressive disorder in 21 countries, The British Journal of Psychiatry 210 (2) (2016) 119–124

G. Thornicroft, et al., Undertreatment of people with major depressive disorder in 21 countries, The British Journal of Psychiatry 210 (2) (2016) 119–124. doi:10.1192/bjp.bp.116.188078

work page doi:10.1192/bjp.bp.116.188078 2016
[4]

C. W. Hoge, et al., Combat duty in Iraq and Afghanistan, mental health problems, and barriers to care, New England Journal of Medicine 351 (1) (2004) 13–22. doi:10.1056/NEJMoa040603

work page doi:10.1056/nejmoa040603 2004
[5]

P. Y. Kim, et al., Stigma, barriers to care, and use of mental health ser- vices among active duty and National Guard soldiers after combat, Psy- chiatric Services 62 (1) (2011) 27–34. doi:10.1176/ps.62.1.pss6201 0027

work page doi:10.1176/ps.62.1.pss6201 2011
[6]

Greene, et al., Stigma and barriers to mental health treatment in the military, Military Medicine 175 (2) (2010) 86–91

T. Greene, et al., Stigma and barriers to mental health treatment in the military, Military Medicine 175 (2) (2010) 86–91. doi:10.7205/MILMED- D-09-00120

work page doi:10.7205/milmed- 2010
[7]

Guo, et al., Automated depression detection using deep learning and natural language processing, ACM Transactions on Computing for Healthcare 1 (3) (2020) 1–19

Z. Guo, et al., Automated depression detection using deep learning and natural language processing, ACM Transactions on Computing for Healthcare 1 (3) (2020) 1–19. doi:10.1145/3372168

work page doi:10.1145/3372168 2020
[8]

Shim, et al., Machine learning-based diagnostic models for psychi- atric disorders: a systematic review, Journal of Psychiatric Research 133 (2021) 1–12

M. Shim, et al., Machine learning-based diagnostic models for psychi- atric disorders: a systematic review, Journal of Psychiatric Research 133 (2021) 1–12. doi:10.1016/j.jpsychires.2020.12.019

work page doi:10.1016/j.jpsychires.2020.12.019 2021
[10]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Google DeepMind, Gemma: Open models based on Gem- ini research and technology, arXiv preprint arXiv:2403.08295 (2024). URLhttps://arxiv.org/abs/2403.08295

work page internal anchor Pith review arXiv 2024
[11]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

M. Abdin, et al., Phi-3 technical report: A highly capable language model locally on your phone, arXiv preprint arXiv:2404.14219 (2024). URLhttps://arxiv.org/abs/2404.14219

work page internal anchor Pith review arXiv 2024
[12]

Qwen2 Technical Report

A. Yang, et al., Qwen2 technical report, arXiv preprint arXiv:2407.10671 (2024). URLhttps://arxiv.org/abs/2407.10671

work page internal anchor Pith review arXiv 2024
[13]

QLoRA: Efficient Finetuning of Quantized LLMs

T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, QLoRA: Effi- cient finetuning of quantized LLMs, in: Advances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv.org/abs/2305.14314

work page internal anchor Pith review arXiv 2023
[14]

Xu, et al., A survey of resource-efficient LLM and multimodal foun- dation models, arXiv preprint arXiv:2401.08092 (2024)

M. Xu, et al., A survey of resource-efficient LLM and multimodal foun- dation models, arXiv preprint arXiv:2401.08092 (2024). URLhttps://arxiv.org/abs/2401.08092

work page arXiv 2024
[15]

Laskaridis, et al., MELTing point: Mobile evaluation of language transformers, arXiv preprint arXiv:2403.12844 (2024)

S. Laskaridis, et al., MELTing point: Mobile evaluation of language transformers, arXiv preprint arXiv:2403.12844 (2024). URLhttps://arxiv.org/abs/2403.12844

work page arXiv 2024
[16]

xliv, 947 pages

American Psychiatric Association, Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5), American Psychiatric Pub- lishing, Arlington, VA, 2013. doi:10.1176/appi.books.9780890425596

work page doi:10.1176/appi.books.9780890425596 2013
[17]

rep., WHO, Geneva, Switzerland (2019)

World Health Organization, International classification of diseases, 11th revision (ICD-11), Tech. rep., WHO, Geneva, Switzerland (2019). URLhttps://icd.who.int

2019
[18]

D. A. Regier, et al., Dsm-5 field trials in the United States and Canada, part II: Test-retest reliability of selected categorical di- agnoses, American Journal of Psychiatry 170 (1) (2013) 59–70. doi:10.1176/appi.ajp.2012.12070999

work page doi:10.1176/appi.ajp.2012.12070999 2013
[19]

Freedman, et al., The initial field trials of DSM-5: new blooms and old thorns, American Journal of Psychiatry 170 (1) (2013) 1–5

R. Freedman, et al., The initial field trials of DSM-5: new blooms and old thorns, American Journal of Psychiatry 170 (1) (2013) 1–5. doi:10.1176/appi.ajp.2012.12091189. 42

work page doi:10.1176/appi.ajp.2012.12091189 2013
[20]

K. S. Kendler, An historical framework for psychiatric nosology, Psychological Medicine 39 (12) (2009) 1935–1941. doi:10.1017/S0033291709005753

work page doi:10.1017/s0033291709005753 2009
[21]

R. M. A. Hirschfeld, et al., Perceptions and impact of bipolar disorder: how far have we really come? results of the national depressive and manic-depressive association 2000 survey, Journal of Clinical Psychiatry 64 (2) (2003) 161–174. doi:10.4088/JCP.v64n0209

work page doi:10.4088/jcp.v64n0209 2000
[22]

Standardization of psychiatric diagnoses–role of fine-tuned llm consortium and openai-gpt-oss reasoning llm enabled de- cision support system,

E. Bandara, R. Gore, A. Yarlagadda, A. H. Clayton, P. Samuel, C. K. Rhea, S. Shetty, Standardization of psychiatric diagnoses–role of fine- tuned llm consortium and openai-gpt-oss reasoning llm enabled decision support system, arXiv preprint arXiv:2510.25588 (2025)

work page arXiv 2025
[23]

Attention Is All You Need

A. Vaswani, et al., Attention is all you need, Advances in Neural Infor- mation Processing Systems (NeurIPS) 30 (2017). URLhttps://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

T. B. Brown, et al., Language models are few-shot learners, Advances in Neural Information Processing Systems (NeurIPS) 33 (2020) 1877–1901. URLhttps://arxiv.org/abs/2005.14165

work page internal anchor Pith review arXiv 2020
[25]

R. Gore, E. Bandara, S. Shetty, A. E. Musto, P. Rana, A. Valencia- Romero, C. Rhea, L. Tayebi, H. Richter, A. Yarlagadda, et al., Proof- of-tbi–fine-tuned vision language model consortium and openai-o3 rea- soning llm-based medical diagnosis support system for mild traumatic brain injury (tbi) prediction, arXiv preprint arXiv:2504.18671 (2025)

work page arXiv 2025
[26]

characterization of time-variant and time-invariant assessment of suicidality on reddit using C-SSRS

M. Gaur, et al., “characterization of time-variant and time-invariant assessment of suicidality on reddit using C-SSRS”, PLOS ONE 16 (5) (2021) e0250448. doi:10.1371/journal.pone.0250448

work page doi:10.1371/journal.pone.0250448 2021
[27]

Flemotomos, et al., Automated quality assessment of cogni- tive behavioral therapy sessions through extracting psycholinguis- tic features, in: Proceedings of Interspeech, 2021, pp

N. Flemotomos, et al., Automated quality assessment of cogni- tive behavioral therapy sessions through extracting psycholinguis- tic features, in: Proceedings of Interspeech, 2021, pp. 4251–4255. doi:10.21437/Interspeech.2021-357

work page doi:10.21437/interspeech.2021-357 2021
[28]

I. Y. Chen, et al., Ethical machine learning in healthcare, Annual Review of Biomedical Data Science 4 (2021) 123–144. doi:10.1146/annurev- biodatasci-092820-114757. 43

work page doi:10.1146/annurev- 2021
[29]

Bandara, A

E. Bandara, A. Hass, S. Shetty, R. Mukkamala, R. Gore, A. Rahman, S. H. Bouk, Deep-stride: Automated security threat modeling with vision-language models, in: 2025 International Conference on Software, Telecommunications and Computer Networks (SoftCOM), 2025, pp. 1– 7

2025
[30]

URLhttps://github.com/ggerganov/ggml

GGML Contributors, GGUF: GPT-generated unified format (2023). URLhttps://github.com/ggerganov/ggml

2023
[31]

Gerganov, llama.cpp: LLM inference in C/C++ (2023)

G. Gerganov, llama.cpp: LLM inference in C/C++ (2023). URLhttps://github.com/ggerganov/llama.cpp

2023
[32]

URLhttps://github.com/mlc-ai/mlc-llm

MLC Team, MLC LLM: Universal LLM deployment engine (2023). URLhttps://github.com/mlc-ai/mlc-llm

2023
[33]

United States Congress, Health insurance portability and accountability act of 1996 (HIPAA), Federal Legislation Public Law 104-191, United States Department of Health and Human Services, Washington, DC (1996)

1996
[34]

rep., Official Journal of the European Union (2016)

European Parliament and Council, General data protection regulation (GDPR), regulation (eu) 2016/679, Tech. rep., Official Journal of the European Union (2016). URLhttps://gdpr-info.eu

2016
[35]

Department of Defense, DoD instruction 8582.01: Privacy in the DoD (2012)

U.S. Department of Defense, DoD instruction 8582.01: Privacy in the DoD (2012). URLhttps://www.esd.whs.mil/DD/

2012
[36]

General Services Administration, FedRAMP: Federal risk and au- thorization management program (2011)

U.S. General Services Administration, FedRAMP: Federal risk and au- thorization management program (2011). URLhttps://www.fedramp.gov

2011
[37]

Blobel, et al., Trustworthy, secure and privacy-protecting electronic health record systems, Methods of Information in Medicine 57 (2018) e47–e57

B. Blobel, et al., Trustworthy, secure and privacy-protecting electronic health record systems, Methods of Information in Medicine 57 (2018) e47–e57. doi:10.3414/ME17-01-0048

work page doi:10.3414/me17-01-0048 2018
[38]

P. S. Appelbaum, Privacy in psychiatric treatment: threats and re- sponses, American Journal of Psychiatry 159 (11) (2015) 1809–1818. doi:10.1176/appi.ajp.159.11.1809. 44

work page doi:10.1176/appi.ajp.159.11.1809 2015
[39]

Rieke, J

N. Rieke, et al., The future of digital health with federated learning, npj Digital Medicine 3 (1) (2020) 119. doi:10.1038/s41746-020-00323-1

work page doi:10.1038/s41746-020-00323-1 2020
[40]

J. C. Duchi, M. I. Jordan, M. J. Wainwright, Local privacy and sta- tistical minimax rates, IEEE Symposium on Foundations of Computer Science (FOCS) (2013) 429–438doi:10.1109/FOCS.2013.53

work page doi:10.1109/focs.2013.53 2013
[41]

Bandara, A

E. Bandara, A. Hass, R. Gore, S. Shetty, R. Mukkamala, S. H. Bouk, X. Liang, N. W. Keong, K. De Zoysa, A. Withanage, et al., Astride: A security threat modeling platform for agentic-ai applications, arXiv preprint arXiv:2512.04785 (2025)

work page arXiv 2025
[42]

URLhttps://github.com/unslothai/unsloth

Unsloth Contributors, Unsloth: Fast and memory-efficient LLM fine- tuning (2024). URLhttps://github.com/unslothai/unsloth

2024
[43]

D. B. Acharya, K. Kuppan, B. Divya, Agentic ai: Autonomous intelli- gence for complex goals–a comprehensive survey, IEEE Access (2025)

2025
[44]

Bandara, R

E. Bandara, R. Gore, P. Foytik, S. Shetty, R. Mukkamala, A. Rahman, X. Liang, S. H. Bouk, A. Hass, S. Rajapakse, et al., A practical guide for designing, developing, and deploying production-grade agentic ai work- flows, arXiv preprint arXiv:2512.08769 (2025)

work page arXiv 2025
[45]

Survey on Evaluation of LLM-based Agents

A. Yehudai, L. Eden, A. Li, G. Uziel, Y. Zhao, R. Bar-Haim, A. Cohan, M. Shmueli-Scheuer, Survey on evaluation of llm-based agents, arXiv preprint arXiv:2503.16416 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Agentsway–software development methodology for ai agents- based teams,

E. Bandara, R. Gore, X. Liang, S. Rajapakse, I. Kularathne, P. Karunarathna, P. Foytik, S. Shetty, R. Mukkamala, A. Rahman, et al., Agentsway–software development methodology for ai agents-based teams, arXiv preprint arXiv:2510.23664 (2025)

work page arXiv 2025
[47]

Towards responsi- ble and explainable ai agents with consensus-driven reasoning,

E. Bandara, T. Hewa, R. Gore, S. Shetty, R. Mukkamala, P. Foytik, A. Rahman, S. H. Bouk, X. Liang, A. Hass, et al., Towards respon- sible and explainable ai agents with consensus-driven reasoning, arXiv preprint arXiv:2512.21699 (2025)

work page arXiv 2025
[48]

Bandara, R

E. Bandara, R. Gore, S. Shetty, S. Rajapakse, I. Kularathna, P. Karunarathna, R. Mukkamala, P. Foytik, S. H. Bouk, A. Rahman, 45 et al., A practical guide to agentic ai transition in organizations, arXiv preprint arXiv:2602.10122 (2026)

work page arXiv 2026
[49]

doi: 10.1046/j.1525-1497.2001.016009606.x

K. Kroenke, R. L. Spitzer, J. B. W. Williams, The PHQ-9: Validity of a brief depression severity measure, Journal of General Internal Medicine 16 (9) (2001) 606–613. doi:10.1046/j.1525-1497.2001.016009606.x

work page doi:10.1046/j.1525-1497.2001.016009606.x 2001
[50]

F. W. Weathers, et al., PTSD checklist for DSM-5 (PCL-5), Tech. rep., National Center for PTSD (2013). URLhttps://www.ptsd.va.gov/professional/assessment/ adult-sr/ptsd-checklist.asp

2013
[51]

URLhttps://developer.arm.com/ip-products/security-ip/ trustzone

ARM Ltd., ARM trustzone technology (2023). URLhttps://developer.arm.com/ip-products/security-ip/ trustzone

2023
[52]

URLhttps://support.apple.com/guide/security/ secure-enclave-sec59b0b31ff/web

Apple Inc., Apple platform security: Secure enclave (2023). URLhttps://support.apple.com/guide/security/ secure-enclave-sec59b0b31ff/web

2023
[53]

Bandara, mental-reasoning: A psychiatric diagnostic conversational dataset for DSM-5 aligned LLM fine-tuning (2025)

E. Bandara, mental-reasoning: A psychiatric diagnostic conversational dataset for DSM-5 aligned LLM fine-tuning (2025). URLhttps://huggingface.co/datasets/lambdaeranga/ mental-reasoning

2025
[54]

E. J. Hu, et al., LoRA: Low-rank adaptation of large language models, arXiv preprint arXiv:2106.09685 (2021). URLhttps://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[55]

Stan- dardization of neuromuscular reflex analysis–role of fine-tuned vision- language model consortium and openai gpt-oss reasoning llm enabled decision support system,

E. Bandara, R. Gore, S. Shetty, R. Mukkamala, C. Rhea, A. Yarlagadda, S. Kaushik, L. De Silva, A. Maznychenko, I. Sokolowska, et al., Stan- dardization of neuromuscular reflex analysis–role of fine-tuned vision- language model consortium and openai gpt-oss reasoning llm enabled decision support system, arXiv preprint arXiv:2508.12473 (2025)

work page arXiv 2025
[56]

R. L. Spitzer, K. Kroenke, J. B. W. Williams, B. L¨ owe, A brief measure for assessing generalized anxiety disorder: the GAD- 7, Archives of Internal Medicine 166 (10) (2006) 1092–1097. doi:10.1001/archinte.166.10.1092. 46

work page doi:10.1001/archinte.166.10.1092 2006
[57]

R. M. A. Hirschfeld, et al., Development and validation of a screen- ing instrument for bipolar spectrum disorder: the mood disorder ques- tionnaire, American Journal of Psychiatry 157 (11) (2000) 1873–1875. doi:10.1176/appi.ajp.157.11.1873

work page doi:10.1176/appi.ajp.157.11.1873 2000
[58]

S. R. Kay, A. Fiszbein, L. A. Opler, The positive and negative syndrome scale (PANSS) for schizophrenia, Schizophrenia Bulletin 13 (2) (1987) 261–276. doi:10.1093/schbul/13.2.261

work page doi:10.1093/schbul/13.2.261 1987
[59]

E. J. Topol, Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again, Basic Books, New York, NY, 2019

2019
[60]

URLhttps://www.sprc.org/resources-programs/ safe-messaging-guidelines

Suicide Prevention Resource Center, Safe messaging guidelines for suicide and mental health (2022). URLhttps://www.sprc.org/resources-programs/ safe-messaging-guidelines

2022
[61]

Kim, et al., Promises and pitfalls of large language models in psychi- atric diagnosis and knowledge tasks, The British Journal of Psychiatry (2024)

Y. Kim, et al., Promises and pitfalls of large language models in psychi- atric diagnosis and knowledge tasks, The British Journal of Psychiatry (2024). doi:10.1192/bjp.2024.83

work page doi:10.1192/bjp.2024.83 2024
[62]

Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

K. Singhal, et al., Large language models encode clinical knowledge, Nature 620 (2023) 172–180. doi:10.1038/s41586-023-06291-2

work page doi:10.1038/s41586-023-06291-2 2023
[63]

K. Yang, T. Zhang, Z. Kuang, Q. Xie, S. Ananiadou, MentalLLaMA: Interpretable mental health analysis on social media with large language models, arXiv preprint arXiv:2309.13567 (2024). URLhttps://arxiv.org/abs/2309.13567

work page arXiv 2024
[64]

URLhttps://arxiv.org/abs/2509.25992

others, MHINDR: A DSM-5 based mental health diagnosis and rec- ommendation framework using LLM, arXiv preprint arXiv:2509.25992 (2025). URLhttps://arxiv.org/abs/2509.25992

work page arXiv 2025
[65]

Golan, et al., LLM questionnaire completion for auto- matic psychiatric assessment, in: Findings of EMNLP, 2024

O. Golan, et al., LLM questionnaire completion for auto- matic psychiatric assessment, in: Findings of EMNLP, 2024. doi:10.18653/v1/2024.findings-emnlp.23

work page doi:10.18653/v1/2024.findings-emnlp.23 2024
[66]

URLhttps://arxiv.org/abs/2508.11398

others, Trustworthy AI psychotherapy: Multi-agent LLM workflow for counseling and explainable mental disorder diagnosis, arXiv preprint 47 arXiv:2508.11398 (2025). URLhttps://arxiv.org/abs/2508.11398

work page arXiv 2025
[67]

N. Sarwar, et al., FedMentalCare: Towards privacy-preserving fine- tuned LLMs to analyze mental health status using federated learning framework, arXiv preprint arXiv:2503.05786 (2025). URLhttps://arxiv.org/abs/2503.05786

work page arXiv 2025
[68]

URLhttps://arxiv.org/abs/2509.14275

others, FedMentor: Domain-aware differential privacy for heteroge- neous federated LLMs in mental health, arXiv preprint arXiv:2509.14275 (2025). URLhttps://arxiv.org/abs/2509.14275

work page arXiv 2025
[70]

Investigating the heat transfer and two-phase fluid flow of nanofluid i n the rough microchannel affected by obstacle structure changes,

S. Pati, et al., Privacy preservation for federated learning in health care, Patterns 5 (7) (2024). doi:10.1016/j.patter.2024.100974

work page doi:10.1016/j.patter.2024.100974 2024
[71]

URLhttps://arxiv.org/abs/2504.00002

others, Are we there yet? a measurement study of efficiency for LLM applications on mobile devices, arXiv preprint arXiv:2504.00002 (2025). URLhttps://arxiv.org/abs/2504.00002

work page arXiv 2025
[72]

B. Yang, et al., DRHouse: An LLM-empowered diagnostic reasoning system through harnessing outcomes from sensor data and expert knowl- edge, in: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 8, 2024, pp. 1–29. doi:10.1145/3699771

work page doi:10.1145/3699771 2024
[73]

doi:10.2196/78410

others, Systematic review of large language models in mental health care, JMIR Mental Health 12 (2025) e78410. doi:10.2196/78410

work page doi:10.2196/78410 2025
[74]

doi:10.1002/wps.21307

others, The evolving field of digital mental health: current evi- dence and implementation issues for smartphone apps, generative artificial intelligence, and virtual reality, World Psychiatry (2025). doi:10.1002/wps.21307. 48

work page doi:10.1002/wps.21307 2025