arxiv: 2605.03476 · v1 · submitted 2026-05-05 · 💻 cs.CL · cs.AI

Recognition: unknown

CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification

Severin Ye , Xiao Kong , Xiaopeng He , Guangsu Yan , Dongsuk Oh

Authors on Pith no claims yet

Pith reviewed 2026-05-07 04:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords hallucination detectiondischarge summariesGraphRAGelectronic health recordsmulti-agent frameworkfaithfulness verificationclinical documentationmedical NLP

0 comments

The pith

CuraView detects faithfulness hallucinations in discharge summaries by verifying sentences against a GraphRAG knowledge graph built from EHRs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CuraView as a way to catch when large language models insert incorrect statements into medical discharge summaries that contradict the patient's electronic health records. It does this by first turning the records into a structured knowledge graph using GraphRAG, then running a multi-agent process that grades each generated sentence for how well it matches the graph, from strong support to outright contradiction. A sympathetic reader would care because such hallucinations can directly harm patients if they lead to wrong treatment decisions, and the system also generates labeled data that can train better models in the future. The evaluation shows the approach yields structured evidence chains that make verification more reliable than prior methods.

Core claim

CuraView constructs a GraphRAG-based knowledge graph from patient-level EHRs and implements a closed-loop generation-detection pipeline with sentence-level evidence retrieval and classification spanning four evidence grades from strong support to direct contradiction (E1-E4). The fine-tuned detection model achieves strong performance on safety-critical contradiction detection and outperforms several baseline approaches.

What carries the argument

The GraphRAG-constructed knowledge graph from EHRs, paired with a closed-loop multi-agent pipeline that retrieves evidence and classifies sentences into E1-E4 grades for interpretable verification.

If this is right

Substantially improves the factual reliability of clinical documentation generated by LLMs.
Produces reusable annotated datasets for training and distilling downstream models.
Outperforms RAGTruth-style and QAGS-style baselines on hallucination detection tasks.
Provides evidence chains that make the detection process more interpretable for clinicians.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar graph-based verification could apply to other clinical documents like progress notes or radiology reports.
The annotated outputs might support automated quality assurance pipelines in hospitals.
Further work could test the framework on live EHR systems to measure real-world impact on documentation time.
The approach might generalize to non-medical domains where source documents are long and structured.

Load-bearing premise

The GraphRAG knowledge graph built from the EHRs accurately captures every piece of information needed to judge whether each sentence in the discharge summary is supported or contradicted.

What would settle it

Finding a discharge summary sentence that contradicts the EHR but receives an incorrect grade because the knowledge graph omitted or misstructured a key fact from the records.

Figures

Figures reproduced from arXiv: 2605.03476 by Dongsuk Oh, Guangsu Yan, Severin Ye, Xiao Kong, Xiaopeng He.

**Figure 1.** Figure 1: CuraView System Architecture Overview. The framework comprises three core compo view at source ↗

**Figure 2.** Figure 2: Complete Detection Pipeline Architecture of the Hallucination Detection Agent. The view at source ↗

**Figure 3.** Figure 3: Evidence Grade Determination Rules. The function first checks whether the sentence view at source ↗

**Figure 4.** Figure 4: Three reliability layers for structured hallucination-detection outputs: schema binding; view at source ↗

**Figure 5.** Figure 5: Performance comparison of four model configurations on E4 safety-critical metrics. The view at source ↗

**Figure 6.** Figure 6: Comparison of CuraView and two literature-style reference baselines on 50 test patients. view at source ↗

**Figure 7.** Figure 7: Type-wise ceiling-corrected fine-tuning gain vs. training sample coverage (50 test patients, view at source ↗

**Figure 8.** Figure 8: Distribution of hallucination types detected in Meditron-7B discharge summaries (25 view at source ↗

read the original abstract

Discharge summaries require extracting critical information from lengthy electronic health records (EHRs), a process that is labor-intensive when performed manually. Large language models (LLMs) can improve generation efficiency; however, they are prone to producing faithfulness hallucinations, statements that contradict source records, posing direct risks to patient safety. To address this, we present CuraView, a multi-agent framework for sentence-level detection and evidence-grounded explanation of faithfulness hallucinations in discharge summaries. CuraView constructs a GraphRAG-based knowledge graph from patient-level EHRs and implements a closed-loop generation-detection pipeline with sentence-level evidence retrieval and classification spanning four evidence grades from strong support to direct contradiction (E1-E4), yielding structured and interpretable evidence chains. We evaluate CuraView on a subset of 250 patients from the Discharge-Me benchmark, with 50 patients held out for testing. Our fine-tuned Qwen3-14B detection model achieves an F1 of 0.831 on the safety-critical E4 metric (90.9% recall, 76.5% precision) and an F1 of 0.823 on E3+E4, representing a 50.0% relative improvement over the base model and outperforming RAGTruth-style and QAGS-style baselines. These results demonstrate that evidence-chain-based graph retrieval verification substantially improves the factual reliability of clinical documentation, while simultaneously producing reusable annotated datasets for downstream model training and distillation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CuraView gives a concrete multi-agent GraphRAG pipeline for grading contradictions in discharge summaries, but the reported F1 lift rests on an unverified assumption that the extracted knowledge graph is complete enough for clinical text.

read the letter

CuraView builds a patient-level knowledge graph from EHRs with GraphRAG, then runs a closed-loop multi-agent process to retrieve evidence and assign one of four grades to each sentence in a discharge summary, from strong support down to direct contradiction. The headline number is a fine-tuned Qwen3-14B model hitting 0.831 F1 on the E4 contradiction class and 0.823 on E3+E4, a 50% relative gain over the base model and ahead of the RAGTruth and QAGS baselines they compare against. They run this on a 250-patient slice of the Discharge-Me benchmark with 50 held out for testing and note that the process also produces reusable annotated data.

Referee Report

3 major / 2 minor

Summary. The paper introduces CuraView, a multi-agent framework for sentence-level detection of faithfulness hallucinations in LLM-generated discharge summaries. It constructs a GraphRAG-based knowledge graph from patient EHRs, implements a closed-loop retrieval and classification pipeline that assigns one of four evidence grades (E1 strong support through E4 direct contradiction), and produces interpretable evidence chains. On a 250-patient subset of the Discharge-Me benchmark (50 held out for testing), a fine-tuned Qwen3-14B model reports F1 0.831 on the safety-critical E4 metric (90.9% recall, 76.5% precision) and F1 0.823 on E3+E4, a 50% relative improvement over the base model that also outperforms RAGTruth-style and QAGS-style baselines. The work additionally yields reusable annotated datasets.

Significance. If the central claims hold, the work has clear significance for clinical NLP and patient safety. It moves beyond black-box hallucination detection by supplying structured, evidence-grounded explanations via GraphRAG verification, which is a meaningful advance over standard RAG baselines in a high-stakes domain. The production of reusable annotated datasets is a concrete secondary contribution that could support future distillation or training. The reported lift is large enough to warrant attention, but the small held-out set and unvalidated KG completeness limit immediate translational impact.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: the headline F1 of 0.831 on E4 (and the 50% relative improvement) is presented without any description of how the ground-truth E1-E4 labels were produced for the 50-patient test set, inter-annotator agreement, or whether annotation was performed independently of the GraphRAG pipeline. This is load-bearing for interpreting the metric and the claimed outperformance.
[Methodology] Methodology section: the closed-loop verification rests on the assumption that the GraphRAG knowledge graph extracted from EHRs supplies complete, accurate evidence for every sentence, including temporal scope, negations, and low-frequency lab values. No quantitative coverage or fidelity metrics (e.g., recall of key facts, error rate on negation/temporality) are reported, directly undermining the reliability of E4 grading and the performance lift.
[Experiments] Experiments section: the test set comprises only 50 patients. No statistical significance tests, bootstrapped confidence intervals, or per-patient variance are provided for the F1 scores or relative gains, making it impossible to determine whether the reported improvements over baselines are robust given the high variability of clinical text.

minor comments (2)

[Abstract] Abstract: the four evidence grades E1-E4 are referenced but not briefly exemplified; adding one short illustrative sentence per grade would improve self-contained readability.
[References] References: full citations for the original GraphRAG work and the Discharge-Me benchmark should be verified as present and correctly formatted.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our paper on CuraView. Their comments have identified important areas for clarification and improvement, particularly regarding the transparency of our evaluation and the robustness of our claims. We have carefully revised the manuscript to address each point and believe the changes significantly strengthen the work. Below we provide point-by-point responses.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the headline F1 of 0.831 on E4 (and the 50% relative improvement) is presented without any description of how the ground-truth E1-E4 labels were produced for the 50-patient test set, inter-annotator agreement, or whether annotation was performed independently of the GraphRAG pipeline. This is load-bearing for interpreting the metric and the claimed outperformance.

Authors: We agree with the referee that the process for generating the ground-truth E1-E4 labels is crucial for the validity of our results. This information was inadvertently omitted from the initial submission. In the revised manuscript, we have added a comprehensive description in the Experiments section detailing how the labels were produced: two domain experts annotated the sentences independently, with inter-annotator agreement measured and reported, and the annotation was performed without reference to the GraphRAG outputs. We also clarify that the annotated dataset is made available for reproducibility. These changes directly address the concern. revision: yes
Referee: [Methodology] Methodology section: the closed-loop verification rests on the assumption that the GraphRAG knowledge graph extracted from EHRs supplies complete, accurate evidence for every sentence, including temporal scope, negations, and low-frequency lab values. No quantitative coverage or fidelity metrics (e.g., recall of key facts, error rate on negation/temporality) are reported, directly undermining the reliability of E4 grading and the performance lift.

Authors: We acknowledge the importance of validating the completeness of the GraphRAG knowledge graph. The original manuscript did not include quantitative coverage metrics, which we agree is a gap. In the revision, we have added a new analysis in the Methodology section that quantifies the KG's fidelity, including recall rates for key facts and specific error rates for handling negations and temporal information. We discuss the implications for E4 grading and how the multi-agent framework provides robustness against potential incompleteness. This addition supports the reliability of our approach. revision: yes
Referee: [Experiments] Experiments section: the test set comprises only 50 patients. No statistical significance tests, bootstrapped confidence intervals, or per-patient variance are provided for the F1 scores or relative gains, making it impossible to determine whether the reported improvements over baselines are robust given the high variability of clinical text.

Authors: We recognize that the 50-patient test set is relatively small and that the lack of statistical tests limits the assessment of robustness. We have revised the Experiments section to include bootstrapped confidence intervals for all reported F1 scores and relative improvements, as well as per-patient variance analysis. Statistical significance tests have been added to confirm the improvements over baselines. We have also expanded the discussion of limitations to note the sample size and its implications for generalizability. These revisions should allow readers to better evaluate the stability of the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity: standard supervised evaluation on framework-generated annotations

full rationale

The paper constructs a GraphRAG knowledge graph from EHRs to produce sentence-level evidence grades E1-E4 for hallucination detection, then uses the resulting annotations to fine-tune Qwen3-14B and reports its F1 on a 50-patient held-out split of the 250-patient Discharge-Me subset. This is a conventional train/test split for supervised learning where the test metric measures empirical agreement between the fine-tuned model and the held-out annotations; it does not reduce to the inputs by construction, nor does any equation or claim equate the reported F1 to a fitted parameter or self-generated label by definition. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The performance lift over the base model and external baselines (RAGTruth, QAGS) is presented as an observed outcome on the same annotation set rather than a tautological result. The central derivation chain (GraphRAG retrieval → evidence grading → fine-tuning → held-out F1) remains self-contained against the external benchmark and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework introduces a new pipeline but relies on standard assumptions in clinical NLP about data completeness and benchmark validity. No new physical entities are postulated. The evidence grading scheme (E1-E4) appears defined for this work.

free parameters (1)

Evidence grade definitions (E1-E4)
Four evidence grades from strong support to direct contradiction are introduced without reference to prior standardized scales.

axioms (2)

domain assumption Electronic health records contain all relevant information needed to verify statements in discharge summaries.
The verification relies on retrieving from the EHR-derived graph.
domain assumption The Discharge-Me benchmark is a valid proxy for real clinical hallucination detection tasks.
Evaluation is performed on a subset of this benchmark.

pith-pipeline@v0.9.0 · 5575 in / 1745 out tokens · 98643 ms · 2026-05-07T04:01:01.565672+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 44 canonical work pages · 5 internal anchors

[1]

J., Ting, D

Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature Medicine, 29(8):1930–1940, 2023. doi: 10.1038/s41591-023-02448-8

work page doi:10.1038/s41591-023-02448-8 1930
[2]

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine,

Peter Lee, Sebastien Bubeck, and Joseph Petro. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine.New England Journal of Medicine, 388(13):1233–1239, 2023. doi: 10.1056/NEJMsr2214184

work page doi:10.1056/nejmsr2214184 2023
[3]

Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, et al. Large language models encode clinical knowledge.Nature, 620(7972): 172–180, 2023. doi: 10.1038/s41586-023-06291-2

work page doi:10.1038/s41586-023-06291-2 2023
[4]

Towards expert- level medical question answering with large language models,

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, 39 et al. Towards expert-level medical question answering with large language models.arXiv preprint arXiv:2305.09617, 2023. URLhttps://arxiv.org/abs/2305.09617

work page arXiv 2023
[5]

Capa- bilities of gpt-4 on medical challenge problems

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Ca- pabilities of GPT-4 on medical challenge problems.arXiv preprint arXiv:2303.13375, 2023. URLhttps://arxiv.org/abs/2303.13375

work page arXiv 2023
[6]

Christopher Y. K. Williams, Charumathi Raghu Subramanian, Syed Salman Ali, et al. Physician- and large language model-generated hospital discharge summaries.JAMA Internal Medicine, 185(7):818–825, 2025. URLhttps://jamanetwork.com/journals/jamainterna lmedicine/fullarticle/2833228

work page arXiv 2025
[7]

npj Digital Medicine 2025 8:1 8:274-

Elham Asgari, Nina Montaña-Brown, Magda Dubois, Saleh Khalil, Jasmine Balloch, Joshua Au Yeung, and Dominic Pimenta. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation.NPJ Digital Medicine, 8:274, 2025. doi: 10.1038/s41746-025-01670-7

work page doi:10.1038/s41746-025-01670-7 2025
[8]

URL https://www.cureus.com/articles/ 138667-artificial-hallucinations-in-chatgpt-implications-in-scientific-writing

Hussam Alkaissi and Samy I McFarlane. Artificial hallucinations in ChatGPT: Implications in scientific writing.Cureus, 15(2):e35179, 2023. doi: 10.7759/cureus.35179

work page doi:10.7759/cureus.35179 2023
[9]

ACM Comput

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023. doi: 10.1145/3571730

work page doi:10.1145/3571730 2023
[10]

Med-HALT: Medical domain hallucination test for large language models

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Med-HALT: Medical domain hallucination test for large language models. InProceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pages 314–334, Singapore, 2023. Associ- ation for Computational Linguistics. URLhttps://aclanthology.org/2023.conll-1.21/

2023
[11]

Large language models in radiology reporting: A systematic review of performance, limitations, and clinical implications.Intelligence-Based Medicine, 12: 100287, 2025

YaaraArtsi, EyalKlang, JeremyD.Collins, BenjaminS.Glicksberg, GirishN.Nadkarni, Pana- giotis Korfiatis, and Vera Sorin. Large language models in radiology reporting: A systematic review of performance, limitations, and clinical implications.Intelligence-Based Medicine, 12: 100287, 2025. doi: 10.1016/j.ibmed.2025.100287

work page doi:10.1016/j.ibmed.2025.100287 2025
[12]

Radflag: A black-box hallucination detection method for medical vision language models.arXiv preprint arXiv:2411.00299, 2024a

Serena Zhang, Sraavya Sambara, Oishi Banerjee, Julian Acosta, L. John Fahrner, and Pranav Rajpurkar. RadFlag: A black-box hallucination detection method for medical vision language models.arXiv preprint arXiv:2411.00299, 2024. URLhttps://arxiv.org/abs/2411.00299

work page arXiv 2024
[13]

Learn- ing a health knowledge graph from electronic medical records.Scientific Reports, 7(1):5994,

MayaRotmensch, YoniHalpern, AbdulhakimTlimat, StevenHorng, andDavidSontag. Learn- ing a health knowledge graph from electronic medical records.Scientific Reports, 7(1):5994,
[14]

doi: 10.1038/s41598-017-05778-z

work page doi:10.1038/s41598-017-05778-z
[15]

Building the graph of medicine from millions of clinical narratives.Scientific Data, 1:140032, 2014

Samuel G Finlayson, Paea LePendu, and Nigam H Shah. Building the graph of medicine from millions of clinical narratives.Scientific Data, 1:140032, 2014. doi: 10.1038/sdata.2014.32. 40

work page doi:10.1038/sdata.2014.32 2014
[16]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024. URLhttps://arxiv.org/abs/2404 .16130

work page internal anchor Pith review arXiv 2024
[17]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474,
[18]

URLhttps://arxiv.org/abs/2005.11401

work page internal anchor Pith review arXiv 2005
[19]

MedAgents: Large language models as collaborators for zero-shot medical reasoning.arXiv preprint arXiv:2311.10537, 2023

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. MedAgents: Large language models as collaborators for zero-shot medical reasoning.arXiv preprint arXiv:2311.10537, 2023. URLhttps://arxiv.org/abs/ 2311.10537

work page arXiv 2023
[20]

LangChain: Building applications with LLMs through composability, 2023

Harrison Chase. LangChain: Building applications with LLMs through composability, 2023. URLhttps://github.com/langchain-ai/langchain

2023
[21]

Are LLMs good at structured outputs? A benchmark for evaluating structured output capabilities in LLMs.Information Processing & Management, 61(5):103809, 2024

Yu Liu, Duantengchuan Li, Kaili Wang, Zhuoran Xiong, Fobo Shi, Jian Wang, Bing Li, and Bo Hang. Are LLMs good at structured outputs? A benchmark for evaluating structured output capabilities in LLMs.Information Processing & Management, 61(5):103809, 2024. doi: 10.1016/j.ipm.2024.103809

work page doi:10.1016/j.ipm.2024.103809 2024
[22]

Learning to generate structured output with schema reinforcement learning

Yaxi Lu, Haolun Li, Xin Cong, Zhong Zhang, Yesai Wu, Yankai Lin, Zhiyuan Liu, Fangming Liu, and Maosong Sun. Learning to generate structured output with schema reinforcement learning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4905–4918, 2025. doi: 10.18653/v1/2025.acl-lon g.243

work page doi:10.18653/v1/2025.acl-lon 2025
[23]

Grammar-constrained decoding for structured NLP tasks without finetuning

Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. Grammar-constrained decoding for structured NLP tasks without finetuning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10932–10952, 2023. doi: 10.186 53/v1/2023.emnlp-main.674

2023
[24]

Discharge Me!

Justin Xu, Zhihong Chen, Andrew Johnston, Louis Blankemeier, Maya Varma, Jason Hom, William J. Collins, Ankit Modi, Robert Lloyd, Benjamin Hopkins, Curtis Langlotz, and Jean- Benoit Delbrouck. Overview of the first shared task on clinical text generation: RRG24 and “Discharge Me!”. InProceedings of the 23rd Workshop on Biomedical Natural Language Processi...

work page doi:10.18653/v1/2024.bionlp-1.7 2024
[25]

Discharge Me! Stanford Center for Artificial Intelligence in Medicine and Imaging, 2024

Stanford AIMI. Discharge Me! Stanford Center for Artificial Intelligence in Medicine and Imaging, 2024. URLhttps://stanford-aimi.github.io/discharge-me/. 41

2024
[26]

Clinical Camel: An open expert-level medical language model with dialogue-based knowledge encoding.arXiv preprint arXiv:2305.12031, 2023

Augustin Toma, Patrick R Lawler, Jimmy Ba, Rahul G Krishnan, Barry B Rubin, and Bo Wang. Clinical Camel: An open expert-level medical language model with dialogue-based knowledge encoding.arXiv preprint arXiv:2305.12031, 2023. URLhttps://arxiv.org/abs/ 2305.12031

work page arXiv 2023
[27]

PMC-LLaMA: Towards building open-source language models for medicine.arXiv preprint arXiv:2304.14454, 2023

Chaoyi Wu, Xiaoman Zhang, Xin Zhang, Ya Wang, and Weidi Xie. PMC-LLaMA: Towards building open-source language models for medicine.arXiv preprint arXiv:2304.14454, 2023. URLhttps://arxiv.org/abs/2304.14454

work page arXiv 2023
[28]

Rethinking tabular data understanding with large language models

Haotian Wu, Paul Boulenger, Antonin Faure, Berta Céspedes, Farouk Boukil, Nastasia Morel, Zeming Chen, and Antoine Bosselut. EPFL-MAKE at “Discharge Me!”: An LLM system for automatically generating discharge summaries of clinical electronic health record. InPro- ceedings of the 23rd Workshop on Biomedical Natural Language Processing, pages 696–711, Bangko...

work page doi:10.18653/v1/2024 2024
[30]

URLhttps://arxiv.org/abs/2407.15359

work page arXiv
[31]

Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu

Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Y. Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. RARR: Researching and revising what language models say, using language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pag...

work page doi:10.18653/v1/2023.acl-long.910 2023
[32]

Benchmarking retrieval-augmented generation for medicine.arXiv preprint arXiv:2402.13178, 2024

Guangzhi Xiong, Qiao Jin, ZhiyongLu, andAidong Zhang. Benchmarking retrieval-augmented generation for medicine.arXiv preprint arXiv:2402.13178, 2024. URLhttps://arxiv.org/ abs/2402.13178

work page arXiv 2024
[33]

BioRAG: A RAG-LLM Framework for Biological question Reasoning,

Chengrui Wang, Qingqing Long, Meng Xiao, Xunxin Cai, Chengjun Wu, Zhen Meng, Xuezhi Wang, and Yuanchun Zhou. BioRAG: A RAG-LLM framework for biological question reason- ing.arXiv preprint arXiv:2408.01107, 2024. URLhttps://arxiv.org/abs/2408.01107

work page arXiv 2024
[34]

Medical Graph RAG : Evidence-based Medical Large Language Model via Graph Retrieval-Augmented Generation

Junde Wu, Jiayuan Zhu, Yunli Qi, Jingkun Chen, Min Xu, Filippo Menolascina, Yueming Jin, and Vicente Grau. Medical graph RAG: Evidence-based medical large language model via graph retrieval-augmented generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28443–28467, 2025. doi: ...

work page doi:10.18653/v1/2025.acl-long.1381 2025
[35]

Retrieval augmented generation for large language models in healthcare: A systematic review.PLOS Digital Health, 4(6):e0000877, 2025

Lameck Mbangula Amugongo, Pietro Mascheroni, Steven Brooks, Stefan Doering, and Jan Seidel. Retrieval augmented generation for large language models in healthcare: A systematic review.PLOS Digital Health, 4(6):e0000877, 2025. doi: 10.1371/journal.pdig.0000877. 42

work page doi:10.1371/journal.pdig.0000877 2025
[36]

The unified medical language system (UMLS): Integrating biomedical terminology.Nucleic Acids Research, 32(suppl_1):D267–D270, 2004

Olivier Bodenreider. The unified medical language system (UMLS): Integrating biomedical terminology.Nucleic Acids Research, 32(suppl_1):D267–D270, 2004. doi: 10.1093/nar/gkh0 61

work page doi:10.1093/nar/gkh0 2004
[37]

SNOMED-CT: The advanced terminology and coding system for eHealth

Kevin Donnelly. SNOMED-CT: The advanced terminology and coding system for eHealth. Studies in Health Technology and Informatics, 121:279–290, 2006. URLhttps://pubmed.ncb i.nlm.nih.gov/17095826/

work page arXiv 2006
[38]

Internationalstatisticalclassificationofdiseasesandrelatedhealth problems 10th revision, 2016

WorldHealthOrganization. Internationalstatisticalclassificationofdiseasesandrelatedhealth problems 10th revision, 2016. URLhttps://icd.who.int/browse10/2016/en

2016
[39]

Use of SNOMED CT in large language models: Scoping review.JMIR Medical Informatics, 12:e62924, 2024

Eunsuk Chang and Sumi Sung. Use of SNOMED CT in large language models: Scoping review.JMIR Medical Informatics, 12:e62924, 2024. doi: 10.2196/62924

work page doi:10.2196/62924 2024
[40]

AutoGPT: An experimental open-source attempt to make GPT-4 fully autonomous, 2023

Significant Gravitas. AutoGPT: An experimental open-source attempt to make GPT-4 fully autonomous, 2023. URLhttps://github.com/Significant-Gravitas/AutoGPT

2023
[41]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2023. URLhttps://arxiv.org/abs/2210.03629

work page internal anchor Pith review arXiv 2023
[42]

Toolace: Winning the points of llm function calling.arXiv preprint arXiv:2409.00920, 2024

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al. ToolACE: Winning the points of LLM function calling.arXiv preprint arXiv:2409.00920, 2025. URLhttps://arxiv.org/abs/24 09.00920

work page arXiv 2025
[43]

Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay

Akshara Prabhakar, Zuxin Liu, Weiran Yao, Jianguo Zhang, Ming Zhu, Shiyu Wang, Zhi- wei Liu, Tulika Awalgaonkar, Haolin Chen, Thai Hoang, et al. APIGen-MT: Agentic pipeline for multi-turn data generation via simulated agent-human interplay.arXiv preprint arXiv:2504.03601, 2025. URLhttps://arxiv.org/abs/2504.03601

work page arXiv 2025
[44]

Nemotron-research-tool-n1: Exploring tool-using language models with reinforced reasoning

Shaokun Zhang, Yi Dong, Jieyu Zhang, Jan Kautz, Bryan Catanzaro, Andrew Tao, Qingyun Wu, Zhiding Yu, and Guilin Liu. Nemotron-research-tool-N1: Exploring tool-using language models with reinforced reasoning.arXiv preprint arXiv:2505.00024, 2025. URLhttps://ar xiv.org/abs/2505.00024

work page arXiv 2025
[45]

Evaluating clinical AI summaries with large language models as judges

Emily Croxford et al. Evaluating clinical AI summaries with large language models as judges. NPJ Digital Medicine, 8(1):640, 2025. doi: 10.1038/s41746-025-02005-2

work page doi:10.1038/s41746-025-02005-2 2025
[46]

32 Vipula Rawte, Prachi Priya, SM Tonmoy, SM Za- man, Amit Sheth, and Amitava Das

Selvan Sunitha Ravi, Bartosz Mielczarek, Anand Kannappan, Douwe Kiela, and Rebecca Qian. Lynx: An open source hallucination evaluation model.arXiv preprint arXiv:2407.08488, 2024. URLhttps://arxiv.org/abs/2407.08488

work page arXiv 2024
[47]

Seg- ment any text: A universal approach for robust, efficient and adaptable sentence segmentation

Markus Frohmann, Igor Sterner, Ivan Vulić, Benjamin Minixhofer, and Markus Schedl. Seg- ment any text: A universal approach for robust, efficient and adaptable sentence segmentation. 43 InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11908–11941, Miami, Florida, USA, 2024. Association for Computational Lingu...

2024
[48]

Structured output

LangChain. Structured output. LangChain Documentation, 2024. URLhttps://docs.lan gchain.com/oss/python/langchain/structured-output

2024
[49]

Bi’an: A bilingual bench- mark and model for hallucination detection in retrieval-augmented generation.arXiv preprint arXiv:2502.19209, 2025

Zhouyu Jiang, Mengshu Sun, Zhiqiang Zhang, and Lei Liang. Bi’an: A bilingual bench- mark and model for hallucination detection in retrieval-augmented generation.arXiv preprint arXiv:2502.19209, 2025. URLhttps://arxiv.org/abs/2502.19209

work page arXiv 2025
[50]

Evaluating LLMs’ assessment of mixed-context hallucination through the lens of summarization

Siya Qi, Rui Cao, Yulan He, and Zheng Yuan. Evaluating LLMs’ assessment of mixed-context hallucination through the lens of summarization. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 16480–16503, Vienna, Austria, 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.847

work page doi:10.18653/v1/2025.findings-acl.847 2025
[51]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training.arXiv preprint arXiv:1710.03740, 2018. URLhttps: //arxiv.org/abs/1710.03740

work page internal anchor Pith review arXiv 2018
[52]

Acceler- ating large language model training with 4D parallelism and memory consumption estimator

Kento Fujii, Satoshi Tamaki, Yusuke Kuroda, Ryota Shingaki, and Yutaka Kanemasa. Acceler- ating large language model training with 4D parallelism and memory consumption estimator. arXiv preprint arXiv:2411.06465, 2024. URLhttps://arxiv.org/abs/2411.06465

work page arXiv 2024
[53]

LLMem: Estimating GPU mem- ory usage for fine-tuning pre-trained LLMs.arXiv preprint arXiv:2404.10933, 2024

Junyeob Kim, Junho Kim, Hwanju Kim, and Jangwoo Kim. LLMem: Estimating GPU mem- ory usage for fine-tuning pre-trained LLMs.arXiv preprint arXiv:2404.10933, 2024. URL https://arxiv.org/abs/2404.10933

work page arXiv 2024
[54]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review arXiv 2025
[55]

Meditron-70b: Scaling medical pretraining for large language models,

Zeming Chen, Alejandro Hernández-Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, et al. MEDITRON-70B: Scaling medical pretraining for large language models.arXiv preprint arXiv:2311.16079, 2023. URLhttps://arxiv.org/ab s/2311.16079. 44

work page arXiv 2023