pith. machine review for the scientific record. sign in

arxiv: 2605.03476 · v1 · submitted 2026-05-05 · 💻 cs.CL · cs.AI

Recognition: unknown

CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification

Authors on Pith no claims yet

Pith reviewed 2026-05-07 04:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords hallucination detectiondischarge summariesGraphRAGelectronic health recordsmulti-agent frameworkfaithfulness verificationclinical documentationmedical NLP
0
0 comments X

The pith

CuraView detects faithfulness hallucinations in discharge summaries by verifying sentences against a GraphRAG knowledge graph built from EHRs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CuraView as a way to catch when large language models insert incorrect statements into medical discharge summaries that contradict the patient's electronic health records. It does this by first turning the records into a structured knowledge graph using GraphRAG, then running a multi-agent process that grades each generated sentence for how well it matches the graph, from strong support to outright contradiction. A sympathetic reader would care because such hallucinations can directly harm patients if they lead to wrong treatment decisions, and the system also generates labeled data that can train better models in the future. The evaluation shows the approach yields structured evidence chains that make verification more reliable than prior methods.

Core claim

CuraView constructs a GraphRAG-based knowledge graph from patient-level EHRs and implements a closed-loop generation-detection pipeline with sentence-level evidence retrieval and classification spanning four evidence grades from strong support to direct contradiction (E1-E4). The fine-tuned detection model achieves strong performance on safety-critical contradiction detection and outperforms several baseline approaches.

What carries the argument

The GraphRAG-constructed knowledge graph from EHRs, paired with a closed-loop multi-agent pipeline that retrieves evidence and classifies sentences into E1-E4 grades for interpretable verification.

If this is right

  • Substantially improves the factual reliability of clinical documentation generated by LLMs.
  • Produces reusable annotated datasets for training and distilling downstream models.
  • Outperforms RAGTruth-style and QAGS-style baselines on hallucination detection tasks.
  • Provides evidence chains that make the detection process more interpretable for clinicians.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar graph-based verification could apply to other clinical documents like progress notes or radiology reports.
  • The annotated outputs might support automated quality assurance pipelines in hospitals.
  • Further work could test the framework on live EHR systems to measure real-world impact on documentation time.
  • The approach might generalize to non-medical domains where source documents are long and structured.

Load-bearing premise

The GraphRAG knowledge graph built from the EHRs accurately captures every piece of information needed to judge whether each sentence in the discharge summary is supported or contradicted.

What would settle it

Finding a discharge summary sentence that contradicts the EHR but receives an incorrect grade because the knowledge graph omitted or misstructured a key fact from the records.

Figures

Figures reproduced from arXiv: 2605.03476 by Dongsuk Oh, Guangsu Yan, Severin Ye, Xiao Kong, Xiaopeng He.

Figure 1
Figure 1. Figure 1: CuraView System Architecture Overview. The framework comprises three core compo view at source ↗
Figure 2
Figure 2. Figure 2: Complete Detection Pipeline Architecture of the Hallucination Detection Agent. The view at source ↗
Figure 3
Figure 3. Figure 3: Evidence Grade Determination Rules. The function first checks whether the sentence view at source ↗
Figure 4
Figure 4. Figure 4: Three reliability layers for structured hallucination-detection outputs: schema binding; view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison of four model configurations on E4 safety-critical metrics. The view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of CuraView and two literature-style reference baselines on 50 test patients. view at source ↗
Figure 7
Figure 7. Figure 7: Type-wise ceiling-corrected fine-tuning gain vs. training sample coverage (50 test patients, view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of hallucination types detected in Meditron-7B discharge summaries (25 view at source ↗
read the original abstract

Discharge summaries require extracting critical information from lengthy electronic health records (EHRs), a process that is labor-intensive when performed manually. Large language models (LLMs) can improve generation efficiency; however, they are prone to producing faithfulness hallucinations, statements that contradict source records, posing direct risks to patient safety. To address this, we present CuraView, a multi-agent framework for sentence-level detection and evidence-grounded explanation of faithfulness hallucinations in discharge summaries. CuraView constructs a GraphRAG-based knowledge graph from patient-level EHRs and implements a closed-loop generation-detection pipeline with sentence-level evidence retrieval and classification spanning four evidence grades from strong support to direct contradiction (E1-E4), yielding structured and interpretable evidence chains. We evaluate CuraView on a subset of 250 patients from the Discharge-Me benchmark, with 50 patients held out for testing. Our fine-tuned Qwen3-14B detection model achieves an F1 of 0.831 on the safety-critical E4 metric (90.9% recall, 76.5% precision) and an F1 of 0.823 on E3+E4, representing a 50.0% relative improvement over the base model and outperforming RAGTruth-style and QAGS-style baselines. These results demonstrate that evidence-chain-based graph retrieval verification substantially improves the factual reliability of clinical documentation, while simultaneously producing reusable annotated datasets for downstream model training and distillation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CuraView, a multi-agent framework for sentence-level detection of faithfulness hallucinations in LLM-generated discharge summaries. It constructs a GraphRAG-based knowledge graph from patient EHRs, implements a closed-loop retrieval and classification pipeline that assigns one of four evidence grades (E1 strong support through E4 direct contradiction), and produces interpretable evidence chains. On a 250-patient subset of the Discharge-Me benchmark (50 held out for testing), a fine-tuned Qwen3-14B model reports F1 0.831 on the safety-critical E4 metric (90.9% recall, 76.5% precision) and F1 0.823 on E3+E4, a 50% relative improvement over the base model that also outperforms RAGTruth-style and QAGS-style baselines. The work additionally yields reusable annotated datasets.

Significance. If the central claims hold, the work has clear significance for clinical NLP and patient safety. It moves beyond black-box hallucination detection by supplying structured, evidence-grounded explanations via GraphRAG verification, which is a meaningful advance over standard RAG baselines in a high-stakes domain. The production of reusable annotated datasets is a concrete secondary contribution that could support future distillation or training. The reported lift is large enough to warrant attention, but the small held-out set and unvalidated KG completeness limit immediate translational impact.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: the headline F1 of 0.831 on E4 (and the 50% relative improvement) is presented without any description of how the ground-truth E1-E4 labels were produced for the 50-patient test set, inter-annotator agreement, or whether annotation was performed independently of the GraphRAG pipeline. This is load-bearing for interpreting the metric and the claimed outperformance.
  2. [Methodology] Methodology section: the closed-loop verification rests on the assumption that the GraphRAG knowledge graph extracted from EHRs supplies complete, accurate evidence for every sentence, including temporal scope, negations, and low-frequency lab values. No quantitative coverage or fidelity metrics (e.g., recall of key facts, error rate on negation/temporality) are reported, directly undermining the reliability of E4 grading and the performance lift.
  3. [Experiments] Experiments section: the test set comprises only 50 patients. No statistical significance tests, bootstrapped confidence intervals, or per-patient variance are provided for the F1 scores or relative gains, making it impossible to determine whether the reported improvements over baselines are robust given the high variability of clinical text.
minor comments (2)
  1. [Abstract] Abstract: the four evidence grades E1-E4 are referenced but not briefly exemplified; adding one short illustrative sentence per grade would improve self-contained readability.
  2. [References] References: full citations for the original GraphRAG work and the Discharge-Me benchmark should be verified as present and correctly formatted.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our paper on CuraView. Their comments have identified important areas for clarification and improvement, particularly regarding the transparency of our evaluation and the robustness of our claims. We have carefully revised the manuscript to address each point and believe the changes significantly strengthen the work. Below we provide point-by-point responses.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the headline F1 of 0.831 on E4 (and the 50% relative improvement) is presented without any description of how the ground-truth E1-E4 labels were produced for the 50-patient test set, inter-annotator agreement, or whether annotation was performed independently of the GraphRAG pipeline. This is load-bearing for interpreting the metric and the claimed outperformance.

    Authors: We agree with the referee that the process for generating the ground-truth E1-E4 labels is crucial for the validity of our results. This information was inadvertently omitted from the initial submission. In the revised manuscript, we have added a comprehensive description in the Experiments section detailing how the labels were produced: two domain experts annotated the sentences independently, with inter-annotator agreement measured and reported, and the annotation was performed without reference to the GraphRAG outputs. We also clarify that the annotated dataset is made available for reproducibility. These changes directly address the concern. revision: yes

  2. Referee: [Methodology] Methodology section: the closed-loop verification rests on the assumption that the GraphRAG knowledge graph extracted from EHRs supplies complete, accurate evidence for every sentence, including temporal scope, negations, and low-frequency lab values. No quantitative coverage or fidelity metrics (e.g., recall of key facts, error rate on negation/temporality) are reported, directly undermining the reliability of E4 grading and the performance lift.

    Authors: We acknowledge the importance of validating the completeness of the GraphRAG knowledge graph. The original manuscript did not include quantitative coverage metrics, which we agree is a gap. In the revision, we have added a new analysis in the Methodology section that quantifies the KG's fidelity, including recall rates for key facts and specific error rates for handling negations and temporal information. We discuss the implications for E4 grading and how the multi-agent framework provides robustness against potential incompleteness. This addition supports the reliability of our approach. revision: yes

  3. Referee: [Experiments] Experiments section: the test set comprises only 50 patients. No statistical significance tests, bootstrapped confidence intervals, or per-patient variance are provided for the F1 scores or relative gains, making it impossible to determine whether the reported improvements over baselines are robust given the high variability of clinical text.

    Authors: We recognize that the 50-patient test set is relatively small and that the lack of statistical tests limits the assessment of robustness. We have revised the Experiments section to include bootstrapped confidence intervals for all reported F1 scores and relative improvements, as well as per-patient variance analysis. Statistical significance tests have been added to confirm the improvements over baselines. We have also expanded the discussion of limitations to note the sample size and its implications for generalizability. These revisions should allow readers to better evaluate the stability of the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity: standard supervised evaluation on framework-generated annotations

full rationale

The paper constructs a GraphRAG knowledge graph from EHRs to produce sentence-level evidence grades E1-E4 for hallucination detection, then uses the resulting annotations to fine-tune Qwen3-14B and reports its F1 on a 50-patient held-out split of the 250-patient Discharge-Me subset. This is a conventional train/test split for supervised learning where the test metric measures empirical agreement between the fine-tuned model and the held-out annotations; it does not reduce to the inputs by construction, nor does any equation or claim equate the reported F1 to a fitted parameter or self-generated label by definition. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The performance lift over the base model and external baselines (RAGTruth, QAGS) is presented as an observed outcome on the same annotation set rather than a tautological result. The central derivation chain (GraphRAG retrieval → evidence grading → fine-tuning → held-out F1) remains self-contained against the external benchmark and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework introduces a new pipeline but relies on standard assumptions in clinical NLP about data completeness and benchmark validity. No new physical entities are postulated. The evidence grading scheme (E1-E4) appears defined for this work.

free parameters (1)
  • Evidence grade definitions (E1-E4)
    Four evidence grades from strong support to direct contradiction are introduced without reference to prior standardized scales.
axioms (2)
  • domain assumption Electronic health records contain all relevant information needed to verify statements in discharge summaries.
    The verification relies on retrieving from the EHR-derived graph.
  • domain assumption The Discharge-Me benchmark is a valid proxy for real clinical hallucination detection tasks.
    Evaluation is performed on a subset of this benchmark.

pith-pipeline@v0.9.0 · 5575 in / 1745 out tokens · 98643 ms · 2026-05-07T04:01:01.565672+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 44 canonical work pages · 5 internal anchors

  1. [1]

    J., Ting, D

    Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature Medicine, 29(8):1930–1940, 2023. doi: 10.1038/s41591-023-02448-8

  2. [2]

    Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine,

    Peter Lee, Sebastien Bubeck, and Joseph Petro. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine.New England Journal of Medicine, 388(13):1233–1239, 2023. doi: 10.1056/NEJMsr2214184

  3. [3]

    Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, et al. Large language models encode clinical knowledge.Nature, 620(7972): 172–180, 2023. doi: 10.1038/s41586-023-06291-2

  4. [4]

    Towards expert- level medical question answering with large language models,

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, 39 et al. Towards expert-level medical question answering with large language models.arXiv preprint arXiv:2305.09617, 2023. URLhttps://arxiv.org/abs/2305.09617

  5. [5]

    Capa- bilities of gpt-4 on medical challenge problems

    Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Ca- pabilities of GPT-4 on medical challenge problems.arXiv preprint arXiv:2303.13375, 2023. URLhttps://arxiv.org/abs/2303.13375

  6. [6]

    Christopher Y. K. Williams, Charumathi Raghu Subramanian, Syed Salman Ali, et al. Physician- and large language model-generated hospital discharge summaries.JAMA Internal Medicine, 185(7):818–825, 2025. URLhttps://jamanetwork.com/journals/jamainterna lmedicine/fullarticle/2833228

  7. [7]

    npj Digital Medicine 2025 8:1 8:274-

    Elham Asgari, Nina Montaña-Brown, Magda Dubois, Saleh Khalil, Jasmine Balloch, Joshua Au Yeung, and Dominic Pimenta. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation.NPJ Digital Medicine, 8:274, 2025. doi: 10.1038/s41746-025-01670-7

  8. [8]

    URL https://www.cureus.com/articles/ 138667-artificial-hallucinations-in-chatgpt-implications-in-scientific-writing

    Hussam Alkaissi and Samy I McFarlane. Artificial hallucinations in ChatGPT: Implications in scientific writing.Cureus, 15(2):e35179, 2023. doi: 10.7759/cureus.35179

  9. [9]

    ACM Comput

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023. doi: 10.1145/3571730

  10. [10]

    Med-HALT: Medical domain hallucination test for large language models

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Med-HALT: Medical domain hallucination test for large language models. InProceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pages 314–334, Singapore, 2023. Associ- ation for Computational Linguistics. URLhttps://aclanthology.org/2023.conll-1.21/

  11. [11]

    Large language models in radiology reporting: A systematic review of performance, limitations, and clinical implications.Intelligence-Based Medicine, 12: 100287, 2025

    YaaraArtsi, EyalKlang, JeremyD.Collins, BenjaminS.Glicksberg, GirishN.Nadkarni, Pana- giotis Korfiatis, and Vera Sorin. Large language models in radiology reporting: A systematic review of performance, limitations, and clinical implications.Intelligence-Based Medicine, 12: 100287, 2025. doi: 10.1016/j.ibmed.2025.100287

  12. [12]

    Radflag: A black-box hallucination detection method for medical vision language models.arXiv preprint arXiv:2411.00299, 2024a

    Serena Zhang, Sraavya Sambara, Oishi Banerjee, Julian Acosta, L. John Fahrner, and Pranav Rajpurkar. RadFlag: A black-box hallucination detection method for medical vision language models.arXiv preprint arXiv:2411.00299, 2024. URLhttps://arxiv.org/abs/2411.00299

  13. [13]

    Learn- ing a health knowledge graph from electronic medical records.Scientific Reports, 7(1):5994,

    MayaRotmensch, YoniHalpern, AbdulhakimTlimat, StevenHorng, andDavidSontag. Learn- ing a health knowledge graph from electronic medical records.Scientific Reports, 7(1):5994,

  14. [14]

    doi: 10.1038/s41598-017-05778-z

  15. [15]

    Building the graph of medicine from millions of clinical narratives.Scientific Data, 1:140032, 2014

    Samuel G Finlayson, Paea LePendu, and Nigam H Shah. Building the graph of medicine from millions of clinical narratives.Scientific Data, 1:140032, 2014. doi: 10.1038/sdata.2014.32. 40

  16. [16]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024. URLhttps://arxiv.org/abs/2404 .16130

  17. [17]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474,

  18. [18]

    URLhttps://arxiv.org/abs/2005.11401

  19. [19]

    MedAgents: Large language models as collaborators for zero-shot medical reasoning.arXiv preprint arXiv:2311.10537, 2023

    Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. MedAgents: Large language models as collaborators for zero-shot medical reasoning.arXiv preprint arXiv:2311.10537, 2023. URLhttps://arxiv.org/abs/ 2311.10537

  20. [20]

    LangChain: Building applications with LLMs through composability, 2023

    Harrison Chase. LangChain: Building applications with LLMs through composability, 2023. URLhttps://github.com/langchain-ai/langchain

  21. [21]

    Are LLMs good at structured outputs? A benchmark for evaluating structured output capabilities in LLMs.Information Processing & Management, 61(5):103809, 2024

    Yu Liu, Duantengchuan Li, Kaili Wang, Zhuoran Xiong, Fobo Shi, Jian Wang, Bing Li, and Bo Hang. Are LLMs good at structured outputs? A benchmark for evaluating structured output capabilities in LLMs.Information Processing & Management, 61(5):103809, 2024. doi: 10.1016/j.ipm.2024.103809

  22. [22]

    Learning to generate structured output with schema reinforcement learning

    Yaxi Lu, Haolun Li, Xin Cong, Zhong Zhang, Yesai Wu, Yankai Lin, Zhiyuan Liu, Fangming Liu, and Maosong Sun. Learning to generate structured output with schema reinforcement learning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4905–4918, 2025. doi: 10.18653/v1/2025.acl-lon g.243

  23. [23]

    Grammar-constrained decoding for structured NLP tasks without finetuning

    Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. Grammar-constrained decoding for structured NLP tasks without finetuning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10932–10952, 2023. doi: 10.186 53/v1/2023.emnlp-main.674

  24. [24]

    Discharge Me!

    Justin Xu, Zhihong Chen, Andrew Johnston, Louis Blankemeier, Maya Varma, Jason Hom, William J. Collins, Ankit Modi, Robert Lloyd, Benjamin Hopkins, Curtis Langlotz, and Jean- Benoit Delbrouck. Overview of the first shared task on clinical text generation: RRG24 and “Discharge Me!”. InProceedings of the 23rd Workshop on Biomedical Natural Language Processi...

  25. [25]

    Discharge Me! Stanford Center for Artificial Intelligence in Medicine and Imaging, 2024

    Stanford AIMI. Discharge Me! Stanford Center for Artificial Intelligence in Medicine and Imaging, 2024. URLhttps://stanford-aimi.github.io/discharge-me/. 41

  26. [26]

    Clinical Camel: An open expert-level medical language model with dialogue-based knowledge encoding.arXiv preprint arXiv:2305.12031, 2023

    Augustin Toma, Patrick R Lawler, Jimmy Ba, Rahul G Krishnan, Barry B Rubin, and Bo Wang. Clinical Camel: An open expert-level medical language model with dialogue-based knowledge encoding.arXiv preprint arXiv:2305.12031, 2023. URLhttps://arxiv.org/abs/ 2305.12031

  27. [27]

    PMC-LLaMA: Towards building open-source language models for medicine.arXiv preprint arXiv:2304.14454, 2023

    Chaoyi Wu, Xiaoman Zhang, Xin Zhang, Ya Wang, and Weidi Xie. PMC-LLaMA: Towards building open-source language models for medicine.arXiv preprint arXiv:2304.14454, 2023. URLhttps://arxiv.org/abs/2304.14454

  28. [28]

    Rethinking tabular data understanding with large language models

    Haotian Wu, Paul Boulenger, Antonin Faure, Berta Céspedes, Farouk Boukil, Nastasia Morel, Zeming Chen, and Antoine Bosselut. EPFL-MAKE at “Discharge Me!”: An LLM system for automatically generating discharge summaries of clinical electronic health record. InPro- ceedings of the 23rd Workshop on Biomedical Natural Language Processing, pages 696–711, Bangko...

  29. [30]

    URLhttps://arxiv.org/abs/2407.15359

  30. [31]

    Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu

    Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Y. Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. RARR: Researching and revising what language models say, using language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pag...

  31. [32]

    Benchmarking retrieval-augmented generation for medicine.arXiv preprint arXiv:2402.13178, 2024

    Guangzhi Xiong, Qiao Jin, ZhiyongLu, andAidong Zhang. Benchmarking retrieval-augmented generation for medicine.arXiv preprint arXiv:2402.13178, 2024. URLhttps://arxiv.org/ abs/2402.13178

  32. [33]

    BioRAG: A RAG-LLM Framework for Biological question Reasoning,

    Chengrui Wang, Qingqing Long, Meng Xiao, Xunxin Cai, Chengjun Wu, Zhen Meng, Xuezhi Wang, and Yuanchun Zhou. BioRAG: A RAG-LLM framework for biological question reason- ing.arXiv preprint arXiv:2408.01107, 2024. URLhttps://arxiv.org/abs/2408.01107

  33. [34]

    Medical Graph RAG : Evidence-based Medical Large Language Model via Graph Retrieval-Augmented Generation

    Junde Wu, Jiayuan Zhu, Yunli Qi, Jingkun Chen, Min Xu, Filippo Menolascina, Yueming Jin, and Vicente Grau. Medical graph RAG: Evidence-based medical large language model via graph retrieval-augmented generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28443–28467, 2025. doi: ...

  34. [35]

    Retrieval augmented generation for large language models in healthcare: A systematic review.PLOS Digital Health, 4(6):e0000877, 2025

    Lameck Mbangula Amugongo, Pietro Mascheroni, Steven Brooks, Stefan Doering, and Jan Seidel. Retrieval augmented generation for large language models in healthcare: A systematic review.PLOS Digital Health, 4(6):e0000877, 2025. doi: 10.1371/journal.pdig.0000877. 42

  35. [36]

    The unified medical language system (UMLS): Integrating biomedical terminology.Nucleic Acids Research, 32(suppl_1):D267–D270, 2004

    Olivier Bodenreider. The unified medical language system (UMLS): Integrating biomedical terminology.Nucleic Acids Research, 32(suppl_1):D267–D270, 2004. doi: 10.1093/nar/gkh0 61

  36. [37]

    SNOMED-CT: The advanced terminology and coding system for eHealth

    Kevin Donnelly. SNOMED-CT: The advanced terminology and coding system for eHealth. Studies in Health Technology and Informatics, 121:279–290, 2006. URLhttps://pubmed.ncb i.nlm.nih.gov/17095826/

  37. [38]

    Internationalstatisticalclassificationofdiseasesandrelatedhealth problems 10th revision, 2016

    WorldHealthOrganization. Internationalstatisticalclassificationofdiseasesandrelatedhealth problems 10th revision, 2016. URLhttps://icd.who.int/browse10/2016/en

  38. [39]

    Use of SNOMED CT in large language models: Scoping review.JMIR Medical Informatics, 12:e62924, 2024

    Eunsuk Chang and Sumi Sung. Use of SNOMED CT in large language models: Scoping review.JMIR Medical Informatics, 12:e62924, 2024. doi: 10.2196/62924

  39. [40]

    AutoGPT: An experimental open-source attempt to make GPT-4 fully autonomous, 2023

    Significant Gravitas. AutoGPT: An experimental open-source attempt to make GPT-4 fully autonomous, 2023. URLhttps://github.com/Significant-Gravitas/AutoGPT

  40. [41]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2023. URLhttps://arxiv.org/abs/2210.03629

  41. [42]

    Toolace: Winning the points of llm function calling.arXiv preprint arXiv:2409.00920, 2024

    Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al. ToolACE: Winning the points of LLM function calling.arXiv preprint arXiv:2409.00920, 2025. URLhttps://arxiv.org/abs/24 09.00920

  42. [43]

    Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay

    Akshara Prabhakar, Zuxin Liu, Weiran Yao, Jianguo Zhang, Ming Zhu, Shiyu Wang, Zhi- wei Liu, Tulika Awalgaonkar, Haolin Chen, Thai Hoang, et al. APIGen-MT: Agentic pipeline for multi-turn data generation via simulated agent-human interplay.arXiv preprint arXiv:2504.03601, 2025. URLhttps://arxiv.org/abs/2504.03601

  43. [44]

    Nemotron-research-tool-n1: Exploring tool-using language models with reinforced reasoning

    Shaokun Zhang, Yi Dong, Jieyu Zhang, Jan Kautz, Bryan Catanzaro, Andrew Tao, Qingyun Wu, Zhiding Yu, and Guilin Liu. Nemotron-research-tool-N1: Exploring tool-using language models with reinforced reasoning.arXiv preprint arXiv:2505.00024, 2025. URLhttps://ar xiv.org/abs/2505.00024

  44. [45]

    Evaluating clinical AI summaries with large language models as judges

    Emily Croxford et al. Evaluating clinical AI summaries with large language models as judges. NPJ Digital Medicine, 8(1):640, 2025. doi: 10.1038/s41746-025-02005-2

  45. [46]

    32 Vipula Rawte, Prachi Priya, SM Tonmoy, SM Za- man, Amit Sheth, and Amitava Das

    Selvan Sunitha Ravi, Bartosz Mielczarek, Anand Kannappan, Douwe Kiela, and Rebecca Qian. Lynx: An open source hallucination evaluation model.arXiv preprint arXiv:2407.08488, 2024. URLhttps://arxiv.org/abs/2407.08488

  46. [47]

    Seg- ment any text: A universal approach for robust, efficient and adaptable sentence segmentation

    Markus Frohmann, Igor Sterner, Ivan Vulić, Benjamin Minixhofer, and Markus Schedl. Seg- ment any text: A universal approach for robust, efficient and adaptable sentence segmentation. 43 InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11908–11941, Miami, Florida, USA, 2024. Association for Computational Lingu...

  47. [48]

    Structured output

    LangChain. Structured output. LangChain Documentation, 2024. URLhttps://docs.lan gchain.com/oss/python/langchain/structured-output

  48. [49]

    Bi’an: A bilingual bench- mark and model for hallucination detection in retrieval-augmented generation.arXiv preprint arXiv:2502.19209, 2025

    Zhouyu Jiang, Mengshu Sun, Zhiqiang Zhang, and Lei Liang. Bi’an: A bilingual bench- mark and model for hallucination detection in retrieval-augmented generation.arXiv preprint arXiv:2502.19209, 2025. URLhttps://arxiv.org/abs/2502.19209

  49. [50]

    Evaluating LLMs’ assessment of mixed-context hallucination through the lens of summarization

    Siya Qi, Rui Cao, Yulan He, and Zheng Yuan. Evaluating LLMs’ assessment of mixed-context hallucination through the lens of summarization. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 16480–16503, Vienna, Austria, 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.847

  50. [51]

    Mixed Precision Training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training.arXiv preprint arXiv:1710.03740, 2018. URLhttps: //arxiv.org/abs/1710.03740

  51. [52]

    Acceler- ating large language model training with 4D parallelism and memory consumption estimator

    Kento Fujii, Satoshi Tamaki, Yusuke Kuroda, Ryota Shingaki, and Yutaka Kanemasa. Acceler- ating large language model training with 4D parallelism and memory consumption estimator. arXiv preprint arXiv:2411.06465, 2024. URLhttps://arxiv.org/abs/2411.06465

  52. [53]

    LLMem: Estimating GPU mem- ory usage for fine-tuning pre-trained LLMs.arXiv preprint arXiv:2404.10933, 2024

    Junyeob Kim, Junho Kim, Hwanju Kim, and Jangwoo Kim. LLMem: Estimating GPU mem- ory usage for fine-tuning pre-trained LLMs.arXiv preprint arXiv:2404.10933, 2024. URL https://arxiv.org/abs/2404.10933

  53. [54]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. URLhttps://arxiv.org/abs/2505.09388

  54. [55]

    Meditron-70b: Scaling medical pretraining for large language models,

    Zeming Chen, Alejandro Hernández-Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, et al. MEDITRON-70B: Scaling medical pretraining for large language models.arXiv preprint arXiv:2311.16079, 2023. URLhttps://arxiv.org/ab s/2311.16079. 44