Recognition: unknown
Coding-Free and Privacy-Preserving Agentic Framework for Data-Driven Clinical Research
Pith reviewed 2026-05-10 16:05 UTC · model grok-4.3
The pith
CARIS uses language models to automate clinical research from planning to reports without any coding or direct data access.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CARIS integrates large language models with modular tools via the Model Context Protocol to execute end-to-end clinical research workflows, completing research planning and IRB documentation within four iterations, supporting Vibe ML, and producing reports with 96 percent LLM-evaluated completeness and 82 percent human-evaluated completeness across three heterogeneous datasets, all while preserving privacy by exposing only outputs to users.
What carries the argument
The Model Context Protocol (MCP) that links LLMs to specialized tools for tasks such as cohort construction and IRB drafting, allowing the entire workflow to be driven by natural language while keeping patient data private.
If this is right
- Clinicians without programming skills can generate IRB-ready documentation and cohort definitions directly from study questions.
- Research teams can iterate on analysis plans while keeping raw patient records inside secure environments.
- Report generation becomes a repeatable output of the same agent loop rather than a separate manual step.
- The same framework can be applied to both public and private datasets without changing the user interface.
Where Pith is reading between the lines
- If the four-iteration bound holds across more institutions, the time from question to IRB submission could drop from weeks to days for many observational studies.
- The privacy design opens the possibility of running the agent on federated hospital data without moving records to a central server.
- Extending the tool set to include statistical power calculations or regulatory checklist completion could further compress the pre-study phase.
- Human oversight remains essential, so the framework may be most useful as a co-pilot rather than a fully autonomous researcher.
Load-bearing premise
Large language models connected through MCP can reliably carry out clinical tasks such as cohort construction and IRB documentation with only the small number of human corrections reported.
What would settle it
Running CARIS on a fresh clinical dataset where planning or IRB documents require more than four rounds of correction or contain critical factual errors that human reviewers must rewrite.
Figures
read the original abstract
Clinical data-driven research requires clinical expertise, programming skills, access to patient data, and extensive documentation, creating barriers and slowing the pace for clinicians and external researchers. To address this, we developed the Clinical Agentic Research Intelligence System (CARIS) that automates the workflow: research planning, literature search, cohort construction, Institutional Review Board (IRB) documentation, Vibe Machine Learning (ML), and report generation, with human-in-the-loop refinement. CARIS integrates Large Language Models (LLMs) with modular tools through the Model Context Protocol (MCP), enabling natural language-driven research without coding while allowing users to access only outputs. We evaluated CARIS on three heterogeneous datasets with distinct clinical tasks, where it completed planning and IRB documentation within four iterations, supported Vibe ML, and generated reports, achieving 96% completeness in LLM-based evaluation and 82% in human evaluation. CARIS demonstrates potential to reduce documentation burden and technical barriers, accelerating data-driven clinical research across public and private data environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CARIS (Clinical Agentic Research Intelligence System), an LLM-based agentic framework using the Model Context Protocol (MCP) to automate end-to-end data-driven clinical research workflows—research planning, literature search, cohort construction, IRB documentation, Vibe ML, and report generation—without requiring user coding while preserving privacy by exposing only outputs. Evaluated on three heterogeneous datasets with distinct clinical tasks, the system is reported to complete planning and IRB documentation in four human-in-the-loop iterations, support Vibe ML, and generate reports, achieving 96% completeness under LLM-based evaluation and 82% under human evaluation.
Significance. If the performance and safety claims are substantiated, CARIS could meaningfully reduce technical and documentation barriers for clinicians and external researchers working with sensitive clinical data, enabling faster iteration on cohort definition and regulatory documentation across public and private environments. The modular MCP integration for tool use without code exposure offers a practical template for privacy-preserving agentic systems in regulated domains.
major comments (3)
- [Abstract / Evaluation] Evaluation methodology (abstract and results): The central performance claims rest on 96% LLM-based and 82% human completeness scores, yet no definition of 'completeness,' error typology (e.g., factual inaccuracies in IRB text or incorrect cohort logic), baseline comparisons, or inter-rater reliability is provided. This leaves the metrics vulnerable to superficial coverage rather than verified clinical accuracy or safety.
- [Results] Human evaluation protocol: The 82% human score is reported without specifying the number or expertise of raters, blinding procedures, ground-truth references for the three datasets, or breakdown of error types (e.g., medical logic errors vs. formatting issues), which is required to support the claim that outputs are usable after only four iterations.
- [Evaluation] Generalizability across datasets: While three heterogeneous datasets are mentioned, no per-dataset or per-task performance breakdown is given, nor any analysis of failure modes on varying data schemas or clinical domains, undermining the assertion of broad applicability.
minor comments (2)
- [Abstract] The acronym 'Vibe ML' is used without an explicit expansion or reference on first use in the abstract.
- [Methods] The description of MCP integration would benefit from a brief diagram or pseudocode showing the tool-calling loop to clarify how privacy is enforced at the protocol level.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. The comments highlight important opportunities to strengthen the transparency and rigor of the evaluation methodology, and we address each major comment point by point below. We will incorporate revisions to address the identified gaps.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Evaluation methodology (abstract and results): The central performance claims rest on 96% LLM-based and 82% human completeness scores, yet no definition of 'completeness,' error typology (e.g., factual inaccuracies in IRB text or incorrect cohort logic), baseline comparisons, or inter-rater reliability is provided. This leaves the metrics vulnerable to superficial coverage rather than verified clinical accuracy or safety.
Authors: We agree that the evaluation methodology requires greater specificity. The submitted manuscript reports the aggregate completeness scores but does not define 'completeness,' provide an error typology, include baseline comparisons, or report inter-rater reliability. In the revised version we will add an explicit definition of completeness (proportion of workflow components generated without critical factual or logical errors), a categorized error typology, a discussion of baseline limitations given the novelty of the end-to-end task, and inter-rater reliability statistics to better substantiate claims regarding clinical accuracy and safety. revision: yes
-
Referee: [Results] Human evaluation protocol: The 82% human score is reported without specifying the number or expertise of raters, blinding procedures, ground-truth references for the three datasets, or breakdown of error types (e.g., medical logic errors vs. formatting issues), which is required to support the claim that outputs are usable after only four iterations.
Authors: We acknowledge that the human evaluation protocol details are not described in the current manuscript. We will revise the Results section to specify the number and expertise of raters, blinding procedures, how ground-truth references were constructed for each dataset, and a breakdown of error types. These additions will provide the necessary context to evaluate the 82% score and the usability claim after four iterations. revision: yes
-
Referee: [Evaluation] Generalizability across datasets: While three heterogeneous datasets are mentioned, no per-dataset or per-task performance breakdown is given, nor any analysis of failure modes on varying data schemas or clinical domains, undermining the assertion of broad applicability.
Authors: We agree that the absence of per-dataset and per-task breakdowns, together with failure-mode analysis, limits the strength of the generalizability claim. The manuscript currently reports only aggregate results across the three datasets. In the revision we will add a table with performance metrics disaggregated by dataset and task, plus a dedicated analysis of observed failure modes related to data schemas and clinical domains. revision: yes
Circularity Check
No circularity: empirical system description without derivations or self-referential predictions
full rationale
The paper presents CARIS as an LLM-integrated agentic framework for clinical research workflows and reports empirical completeness metrics (96% LLM-based, 82% human) on three datasets. No equations, first-principles derivations, fitted parameters, or predictions appear in the provided text or abstract. Claims rest on iterative human-in-the-loop execution and external evaluations rather than any chain that reduces to its own inputs by construction. Self-citations, if present, are not load-bearing for any mathematical result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can be safely and accurately guided through clinical research workflows using modular tools and human-in-the-loop refinement.
invented entities (2)
-
CARIS
no independent evidence
-
Vibe ML
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Watson J, Hutyra CA, Clancy SM, et al. Overcoming barriers to the adoption and implementation of predictive modeling and machine learning in clinical care: what can we learn from us academic medical cen- ters? JAMIA Open 2020;3:167–172. DOI: 10.1093/jamiaopen/ooz046
-
[2]
Democratizing artificial intel- ligence imaging analysis with automated machine learning: tutorial
Thirunavukarasu AJ, Elangovan K, Gutier- rez L, et al. Democratizing artificial intel- ligence imaging analysis with automated machine learning: tutorial. J Med Internet Res 2023;25:e49949. DOI: 10.2196/49949
-
[3]
Achiam J, Adler S, Agarwal S, et al. Gpt- 4 technical report. March 15, 2023 (https:// arxiv.org/abs/2303.08774). Preprint
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Palm: Scaling language modeling with path- ways
Chowdhery A, Narang S, Devlin J, et al. Palm: Scaling language modeling with path- ways. J Mach Learn Res 2023;24:1–113. (http: //jmlr.org/papers/v24/22-1144.html)
2023
-
[5]
The llama 4 herd: The beginning of a new era of natively multimodal ai inno- vation
Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai inno- vation. April 5, 2025 (https://ai.meta.com/ blog/llama-4-multimodal-intelligence/)
2025
-
[6]
Gemma 3 technical report
Gemma Team. Gemma 3 technical report. March 25, 2025 (https://arxiv.org/abs/2503. 19786). Preprint
2025
-
[7]
A survey on large language model based autonomous agents , volume=
Wang L, Ma C, Feng X, et al. A survey on large language model based autonomous agents. Front Comput Sci 2024;18:186345. DOI: 10.1007/s11704-024-40231-1
-
[8]
A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges
Li X, Wang S, Zeng S, Wu Y, Yang Y. A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges. Vic- inagearth 2024;1:9. DOI: 10.1007/s44336-024- 00009-2
-
[9]
Llm-based agentic reasoning frame- works: A survey from methods to scenarios
Zhao B, Foo LG, Hu P, Theobalt C, Rahmani H, Liu J. Llm-based agentic reasoning frame- works: A survey from methods to scenarios. August 25, 2025 (https://arxiv.org/abs/2508. 17692). Preprint
2025
-
[10]
Autogen: Enabling next-gen llm applications via multi- agent conversations In: First Conference on Language Modeling (COLM)
Wu Q, Bansal G, Zhang J, et al. Autogen: Enabling next-gen llm applications via multi- agent conversations In: First Conference on Language Modeling (COLM). 2024 (https:// openreview.net/forum?id=BAakY1hNKS)
2024
-
[11]
Multi-agent col- laboration mechanisms: A survey of llms
Tran KT, Dao D, Nguyen MD, Pham QV, O’Sullivan B, Nguyen HD. Multi-agent col- laboration mechanisms: A survey of llms. Jan- uary 10, 2025 (https://arxiv.org/abs/2501. 06322). Preprint
2025
-
[12]
Baek J, Jauhar SK, Cucerzan S, Hwang SJ. Researchagent: Iterative research idea generation over scientific literature with large language models In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies (Volume 1: Long Papers). ACL: 2025:6709–6738 (https://...
2025
-
[13]
DS - Agent : Automated Data Science by Empowering Large Language Models with Case - Based Reasoning
Guo S, Deng C, Wen Y, Chen H, Chang Y, Wang J. Ds-agent: Automated data sci- ence by empowering large language mod- els with case-based reasoning. February 27, 2024 (https://arxiv.org/abs/2402.17453). Preprint
-
[14]
Agentic ai for scientific discovery: A survey of progress, challenges, and future directions
Gridach M, Nanavati J, Abidine KZE, Mendes L, Mack C. Agentic ai for scientific discovery: A survey of progress, challenges, and future directions. March 12, 2025 (https: //arxiv.org/abs/2503.08979). Preprint
-
[15]
The virtual lab: Ai agents design new sars-cov-2 nanobodies with exper- imental validation
Swanson K, Wu W, Bulaong NL, Pak JE, Zou J. The virtual lab: Ai agents design new sars-cov-2 nanobodies with exper- imental validation. November 11, 2024 (https://www.biorxiv.org/content/10.1101/ 2024.11.11.623004v1). Preprint
2024
-
[16]
Frontiers in Digital Health , author =
Artsi Y, Sorin V, Glicksberg BS, Korfiatis P, Nadkarni GN, Klang E. Large language models in real-world clinical workflows: a sys- tematic review of applications and implemen- tation. Front Digit Health 2025;7:1659134. DOI:10.3389/fdgth.2025.1659134
-
[17]
Bar- riers to and facilitators of artificial intel- ligence adoption in health care: scoping review
Hassan M, Kushniruk A, Borycki E. Bar- riers to and facilitators of artificial intel- ligence adoption in health care: scoping review. JMIR Hum Factors 2024;11:e48633. DOI:10.2196/48633
-
[18]
Multi-site research using electronic health record data: Lessons learned from a case study
Garcia B, Hogarth M, Wang Y, Zhu X, Tu SP. Multi-site research using electronic health record data: Lessons learned from a case study. Learn Health Syst 2025;9:e70039. DOI:10.1002/lrh2.10439
-
[19]
Lee M, Kim K, Shin Y, Lee Y, Kim TJ. Advancements in electronic medical records for clinical trials: Enhancing data manage- ment and research efficiency. Cancers (Basel) 2025;17:1552. DOI:10.3390/cancers17091552
-
[20]
Beyond Sleep Staging: Advancing End-to-End Event Scoring in Sleep Medicine
Holmes JH, Beinlich J, Boland MR, et al. Why is the electronic health record so chal- lenging for research and clinical care? Meth- ods Inf Med 2021;60:032–048. DOI:10.1055/s- 0041-1731784
work page doi:10.1055/s- 2021
-
[21]
Ehtesham A, Singh A, Kumar S. Enhanc- ing clinical decision support and ehr insights through llms and the model context pro- tocol: An open-source mcp-fhir framework. June 18, 2025 (https://ieeexplore.ieee.org/ abstract/document/11105280). Preprint
-
[22]
A survey of the model context protocol (mcp): Standardizing context to enhance large lan- guage models (llms)
Singh A, Ehtesham A, Kumar S, Khoei TT. A survey of the model context protocol (mcp): Standardizing context to enhance large lan- guage models (llms). April 3, 2025 (https: //www.preprints.org/frontend/manuscript/ b45407370ad06ed48b5ebc462c1d8a2c/ download pub). Preprint
2025
-
[23]
Conversational llms simplify secure clinical data access, understanding, and anal- ysis
Attrach RA, Moreira P, Fani R, Umeton R, Celi LA. Conversational llms simplify secure clinical data access, understanding, and anal- ysis. July 1, 2025 (https://doi.org/10.48550/ arXiv.2507.01053). Preprint
-
[24]
Collins GS, Moons KG, Dhiman P, et al. Tripod+ ai statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. bmj 2024;385. DOI:10.1136/bmj-2023-078378
-
[25]
Evidence-based medicine: How to practice & teach EBM
Haynes RB, Sackett DL, Richardson WS, Rosenberg W, Langley GR. Evidence-based medicine: How to practice & teach EBM. Canadian Medical Association. Journal 1997;157:788
1997
-
[26]
Best match: new relevance search for pubmed
Fiorini N, Canese K, Starchenko G, et al. Best match: new relevance search for pubmed. PLoS Biol 2018;16:e2005343. DOI:10.1371/journal.pbio.2005343
-
[27]
A unified approach to interpreting model predictions In: Advances in Neural Information Pro- cessing Systems
Lundberg SM, Lee SI. A unified approach to interpreting model predictions In: Advances in Neural Information Pro- cessing Systems. 2017:4765–4774 (https: //proceedings.neurips.cc/paper/2017/hash/ 8a20a8621978632d76c43dfd28b67767-Abstract. html)
2017
-
[28]
MIMIC-IV , a freely accessible electronic health record dataset,
Johnson AE, Bulgarelli L, Shen L, et al. Mimic-iv, a freely accessible electronic health record dataset. Sci data 2023;10:1. DOI:10.1038/s41597-022-01899-x
-
[29]
Mimic-iv
Johnson A, Bulgarelli L, Pollard T, et al. Mimic-iv. October 11, 2024 (https:// physionet.org/content/mimiciv/3.1/)
2024
-
[30]
Inspire, a publicly available research dataset for peri- operative medicine
Lim L, Lee H, Jung CW, et al. Inspire, a publicly available research dataset for peri- operative medicine. Sci Data 2024;11:655. 9 DOI:10.1038/s41597-024-03517-4
-
[31]
Inspire, a publicly available research dataset for perioperative medicine
Lim L, Lee HC. Inspire, a publicly available research dataset for perioperative medicine. August 11, 2024 (https://physionet.org/ content/inspire/1.3/)
2024
-
[32]
Syntheticmass data, version 2
Walonoski J, Kramer M, Nichols J, et al. Syntheticmass data, version 2. May 24, 2017 (https://synthea.mitre.org/downloads)
2017
-
[33]
Explainable machine learning for icu readmission prediction.arXiv preprint arXiv:2309.13781, 2023
de S´a AG, Gould D, Fedyukova A, et al. Explainable machine learning for icu readmis- sion prediction. September 13, 2024 (https: //arxiv.org/abs/2309.13781). Preprint
-
[34]
Sun R, Li S, Wei Y, et al. Develop- ment of interpretable machine learning mod- els for prediction of acute kidney injury after noncardiac surgery: a retrospective cohort study. Int J Surg 2024;110:2950–2962. DOI:10.1097/JS9.0000000000001237
-
[35]
Risk stratification at prediabetes onset and association with dia- betes outcomes using EHR data
Luo J, Hu D, Han R, et al. Risk stratification at prediabetes onset and association with dia- betes outcomes using EHR data. NPJ Metab Health Dis 2025;3:48. DOI:10.1038/s44324- 025-00091-0
-
[36]
Sellergren A, Kazemzadeh S, Jaroensri T, et al. Medgemma technical report. July 7, 2025 (https://arxiv.org/abs/2507.05201). Preprint. 10 Supplementary Appendix Supplementary Note 1. Orchestration prompt. The LLM orchestrates the workflow by interpreting user input and mapping it to appropriate tool invocations, guided by the prompt below, along with the a...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Take initiative: When given a research question, outline the full approach (data source -> cohort selection -> variables -> analysis plan -> expected outputs) before asking for confirmation
-
[38]
Check distributions, missing values, and sample sizes
Be thorough: Run exploratory queries to understand the data before jumping to analysis. Check distributions, missing values, and sample sizes
-
[39]
Document everything: Save intermediate results as CSV files, produce well-structured Word reports, and keep a clear audit trail
-
[40]
Communicate like a researcher: Use precise terminology, cite statistical rationale, and present results with context (confidence intervals, effect sizes, limitations)
-
[41]
Present all results clearly with proper formatting
Chain tools effectively: Combine database queries, file operations, ML pipelines, and document generation in a single workflow when appropriate. Present all results clearly with proper formatting. Supplementary Note 2. Tools descriptions. Table S1 provides a complete list and descriptions of key tools available to the agents. 1 Table S1 Key Tools for Clin...
-
[42]
Refine into a noun-phrase style topic
-
[43]
Remove extra background details and verbose wording
-
[44]
Keep essential domain terminology
-
[45]
topic_refined
Provide one English version IMPORTANT: You must respond in English only, regardless of the input language. Respond in the following JSON format: ```json {"topic_refined": "Refined research topic in English", "topic_en": "Refined research topic in English"} ``` The LLM prompt used to generate questions to draft the research plan is shown below. You are an ...
-
[46]
Generate exactly 12 questions total
-
[47]
Cover all four PIMO categories (P, I, M, O) at least once
-
[48]
Questions must not overlap in intent
-
[49]
Each question must include target_section and pimo_category
-
[50]
questions
Keep each question specific and answerable by a user IMPORTANT: You must respond in English only, regardless of the input language. Respond in the following JSON format: ```json { "questions": [ {"question_id": "Q1", "target_section": "research_purpose", "pimo_category": "P", "question": "Which prediabetic patient population is the primary target, and wha...
-
[51]
Clearly explain why the research is needed and what it aims to solve
-
[52]
Include background context, core objective, and hypothesis direction
-
[53]
Do not include implementation-level methods
-
[55]
Do not repeat claims or wording from previous sections
-
[56]
research_purpose
Define all abbreviations and technical terms on first use; write so that non-experts can follow IMPORTANT: You must respond in English only, regardless of the input language. Respond in the following JSON format: ```json {"research_purpose": "Research purpose paragraph in English"} ``` The LLM prompt used to generate the research design section based on t...
-
[57]
Describe study type and structural framework
-
[58]
Include cohort/group structure, comparison setup, and timeline scope; also address informed consent procedures (or justification for waiver), potential risks to participants and mitigation strategies, and applicable ethical guidelines (e.g., Declaration of Helsinki, IRB policies)
-
[59]
Minimize repetition of purpose rationale or analysis details
-
[61]
Do not overlap with previous sections
-
[62]
research_design
Define all abbreviations and technical terms on first use; write so that non-experts can follow IMPORTANT: You must respond in English only, regardless of the input language. Respond in the following JSON format: ```json {"research_design": "Research design paragraph in English"} ``` The LLM prompt to generate the research method section based on the resp...
-
[63]
Describe subject criteria, data collection, and variable definition
-
[64]
Include privacy protections (de-identification methods, access controls), data storage location, security measures, retention and deletion policies, and quality control procedures
-
[65]
Focus on operational workflow, not high-level design claims
-
[67]
Avoid repeated statements from previous sections
-
[68]
research_method
Define all abbreviations and technical terms on first use; write so that non-experts can follow IMPORTANT: You must respond in English only, regardless of the input language. Respond in the following JSON format: ```json {"research_method": "Research method paragraph in English"} ``` The LLM prompt to generate the validity evaluation section based on the ...
-
[69]
Specify validation strategy, statistical analyses, and performance metrics
-
[70]
Include confounder control, sensitivity analysis, and reproducibility checks
-
[71]
Keep it concrete and methodologically explicit
-
[73]
Do not repeat previous sections
-
[74]
validity_evaluation
Define all abbreviations and technical terms on first use; write so that non-experts can follow IMPORTANT: You must respond in English only, regardless of the input language. Respond in the following JSON format: ```json {"validity_evaluation": "Validity evaluation paragraph in English"} ``` The LLM prompt to generate the expected effects section based on...
-
[75]
Cover realistic academic, clinical, and practical impacts
-
[76]
Avoid overly optimistic language
-
[77]
Disclose potential conflicts of interest if applicable
-
[78]
Focus on significance and applicability, not detailed result patterns
-
[79]
Write one academic paragraph
-
[80]
Avoid overlap with previous sections
-
[81]
expected_effects
Define all abbreviations and technical terms on first use; write so that non-experts can follow IMPORTANT: You must respond in English only, regardless of the input language. Respond in the following JSON format: ```json {"expected_effects": "Expected effects paragraph in English"} ``` The LLM prompt to generate the anticipated results section based on th...
-
[82]
Describe expected trends and directional outcomes for key endpoints
-
[83]
Differentiate primary and secondary outcomes when relevant
-
[84]
Keep this section outcome-focused, not significance-focused
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.