pith. sign in

arxiv: 2606.25651 · v2 · pith:NMY6ROC3new · submitted 2026-06-24 · 💻 cs.CL

MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction

Pith reviewed 2026-06-29 05:02 UTC · model grok-4.3

classification 💻 cs.CL
keywords medical error detectionmulti-agent systemslarge language modelsin-context learningclinical noteserror correctionhealthcare AI safetyKPCS metric
0
0 comments X

The pith

MedGuards treats medical error detection and correction as a multi-agent in-context learning task with specialized agents and confidence-guided arbitration to improve reliability without retraining LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MedGuards to address the problem of errors in LLM-generated or existing medical text, which can risk patient safety. It treats the task as a multi-agent setup where separate agents detect, localize, and correct errors, and an arbitration mechanism uses confidence scores and reasoning traces to resolve disagreements. This approach is tested on four multilingual datasets of clinical notes and shows improvements over existing methods. A new metric called Keyword-Prioritized Correction Score is introduced for evaluation. The framework aims to make LLM deployment in healthcare safer by enhancing interpretability and adaptability.

Core claim

MedGuards is a framework that treats medical error detection and correction as a multi-agent in-context learning task. Specialized agents separately detect, localize, and correct errors, while a confidence-guided arbitration mechanism resolves disagreements using reasoning traces and confidence scores. This design enhances interpretability, robustness, and adaptability without requiring additional training of the base LLMs, and experiments demonstrate significant improvements across several metrics and models on four multilingual medical datasets.

What carries the argument

The multi-agent in-context learning task with specialized agents for detection, localization, and correction, plus a confidence-guided arbitration mechanism that resolves disagreements using reasoning traces and confidence scores.

If this is right

  • Enhances interpretability of the error handling process through reasoning traces.
  • Improves robustness and adaptability across different datasets and models without additional training.
  • Introduces the Keyword-Prioritized Correction Score (KPCS) as a more comprehensive evaluation metric.
  • Supports safer deployment of LLMs in real-world healthcare applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar multi-agent setups could be tested in other high-stakes domains like legal document review.
  • The arbitration mechanism might be adapted for general multi-agent LLM tasks beyond medicine.
  • Future work could explore combining this with fine-tuning for even better performance.

Load-bearing premise

Existing automated checks and heuristic-based approaches do not generalize well across unseen datasets while the proposed multi-agent in-context learning design will.

What would settle it

Running the framework on one of the four multilingual clinical note datasets and finding no significant improvement in metrics compared to baseline methods would challenge the claim of effectiveness.

Figures

Figures reproduced from arXiv: 2606.25651 by Congbo Ma, Farah E. Shamout, Hu Wang, Yichun Zhang.

Figure 1
Figure 1. Figure 1: High-level Overview of MedGuards. when factual accuracy and domain-specific terminology are critical [27, 25, 13]. Domain-adapted metrics address this gap. CUI F-score measures overlap in UMLS concepts [14], SapBERTScore leverages biomedical embeddings to capture terminology similarity [23], and Clinical-BLEURT adapts BLEURT for clinical summarization and dialogue [8]. Overall, these metrics were developed… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of self-consistency rate on the MEDEC [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of self-consistency rate on the English dataset of MedErrBench test set. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of self-consistency rate on the Arabic dataset of MedErrbench test set. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of self-consistency rate on the Chinese dataset of MedErrbench test set [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

As Large Language Models (LLMs) are increasingly deployed in healthcare settings, accurate error detection and correction in generated or existing text becomes critical, as even minor mistakes can pose risks to patient safety. Existing methods for error detection and correction, including automated checks and heuristic-based approaches, do not generalize well across unseen datasets. In this paper, we propose MedGuards as a medical safety guardrail, which is a new framework that treats medical error detection and correction as a multi-agent in-context learning task. Specialized agents separately detect, localize, and correct errors, while a confidence-guided arbitration mechanism resolves disagreements using reasoning traces and confidence scores. This design enhances interpretability, robustness, and adaptability, without requiring additional training of the base LLMs. Additionally, we introduce the Keyword-Prioritized Correction Score (KPCS), a new evaluation metric that considers whether critical keywords within the reference text are generated correctly, providing a more comprehensive assessment than conventional metrics. Experiments across four multilingual medical datasets consisting of clinical notes demonstrate significant improvements by the proposed framework across several metrics and models. Our aim is to enable safer deployment of LLMs in real-world healthcare applications. For reproducibility, we make our code publicly available at https://github.com/congboma/MedGuards.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes MedGuards, a multi-agent in-context learning framework for medical error detection and correction in LLM outputs. It uses specialized agents for detection, localization, and correction, plus a confidence-guided arbitration mechanism based on reasoning traces and scores. The work introduces the Keyword-Prioritized Correction Score (KPCS) metric and claims significant improvements over baselines across four multilingual clinical-note datasets without requiring additional LLM training.

Significance. If the empirical claims hold with proper cross-dataset controls and statistical support, the multi-agent ICL design and KPCS metric could meaningfully improve interpretability and robustness for LLM safety guardrails in healthcare. The training-free adaptability is a potential strength for real-world deployment.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The central motivation states that existing automated checks and heuristics 'do not generalize well across unseen datasets,' yet the text supplies no protocol details on cross-dataset hold-out (e.g., whether in-context examples are drawn exclusively from three of the four datasets with final metrics reported on the held-out fourth). This separation is load-bearing for attributing gains to the claimed robustness rather than dataset-specific prompting.
  2. [Abstract] Abstract: The claim of 'significant improvements ... across several metrics and models' is asserted without any quantitative results, baseline comparisons, error bars, dataset statistics, or statistical tests. The full manuscript must supply these to substantiate the performance claims.
minor comments (1)
  1. [Abstract] The public code release at the cited GitHub link supports reproducibility and should be retained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The central motivation states that existing automated checks and heuristics 'do not generalize well across unseen datasets,' yet the text supplies no protocol details on cross-dataset hold-out (e.g., whether in-context examples are drawn exclusively from three of the four datasets with final metrics reported on the held-out fourth). This separation is load-bearing for attributing gains to the claimed robustness rather than dataset-specific prompting.

    Authors: We agree that the manuscript does not supply explicit protocol details on cross-dataset hold-out. To support the generalization claims, we will revise §4 to describe the evaluation protocol in full, including how in-context examples were selected across the four datasets and whether a leave-one-dataset-out design was employed. revision: yes

  2. Referee: [Abstract] Abstract: The claim of 'significant improvements ... across several metrics and models' is asserted without any quantitative results, baseline comparisons, error bars, dataset statistics, or statistical tests. The full manuscript must supply these to substantiate the performance claims.

    Authors: The detailed quantitative results with baselines, error bars, dataset statistics, and statistical tests appear in §4. We will revise the abstract to include representative quantitative highlights and baseline comparisons to better substantiate the performance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: framework proposal contains no derivations or self-referential reductions

full rationale

The paper proposes MedGuards as a multi-agent ICL system with specialized agents and a new KPCS metric, motivated by a claim that prior heuristic methods fail to generalize. No equations, parameter fittings, or derivation chains appear in the abstract or described structure. Claims rest on experimental results across four datasets rather than any step that reduces by construction to its own inputs, self-citations, or fitted values renamed as predictions. The generalization motivation is an empirical assertion open to external verification and does not constitute a circular step under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5756 in / 1268 out tokens · 30801 ms · 2026-06-29T05:02:15.538806+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 11 canonical work pages · 3 internal anchors

  1. [1]

    Overview of the mediqa-corr 2024 shared task on medical error detection and correction

    Asma Ben Abacha, Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Fei Xia, and Meliha Yetisgen-Yildiz. Overview of the mediqa-corr 2024 shared task on medical error detection and correction. In Proceedings of the 6th Clinical Natural Language Processing Workshop, pages 596–603, 2024

  2. [2]

    MEDEC: A benchmark for medical error detection and correction in clinical notes

    Asma Ben Abacha, Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Meliha Yetisgen, Fei Xia, and Thomas Lin. MEDEC: A benchmark for medical error detection and correction in clinical notes. InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 22539–22550, 2025

  3. [3]

    The challenge of uncertainty quantification of large language models in medicine.arXiv preprint arXiv:2504.05278, 2025

    Zahra Atf, Seyed Amir Ahmad Safavi-Naini, Peter R Lewis, Aref Mahjoubfar, Nariman Naderi, Thomas R Savage, and Ali Soroush. The challenge of uncertainty quantification of large language models in medicine.arXiv preprint arXiv:2504.05278, 2025

  4. [4]

    AKNL Aththanagoda, KASH Kulathilake, and NA Abdullah. Precision and personalization: How large language models redefining diagnostic accuracy in personalized medicine—a sys- tematic literature review.IEEE Journal of Biomedical and Health Informatics, 2025

  5. [5]

    Rareagents: Advancing rare disease care through llm-empowered multi-disciplinary team.arXiv preprint arXiv:2412.12475, 2024

    Xuanzhong Chen, Ye Jin, Xiaohao Mao, Lun Wang, Shuyang Zhang, and Ting Chen. Rareagents: Advancing rare disease care through llm-empowered multi-disciplinary team.arXiv preprint arXiv:2412.12475, 2024

  6. [6]

    Mederrbench: A fine-grained multilingual benchmark for medical error detection and correction with clinical expert annotations

    Ma Congbo, Zhang Yichun, Al-Jazzazi Yousef, Foisal Ahamed, Sharma Laasya, Sadqi Yousra, Saleh Khaled, Mallat Jihad, and Shamout Farah E. Mederrbench: A fine-grained multilingual benchmark for medical error detection and correction with clinical expert annotations. In Findings of the Association for Computational Linguistics, ACL 2026, San Diego, Californi...

  7. [7]

    Iryonlp at MEDIQA-CORR 2024: Tackling the medical error detection & correction task on the shoulders of medical agents

    Jean-Philippe Corbeil. Iryonlp at MEDIQA-CORR 2024: Tackling the medical error detection & correction task on the shoulders of medical agents. InProceedings of the 6th Clinical Natural Language Processing Workshop, ClinicalNLP@NAACL 2024, Mexico City, Mexico, June 21, 2024, pages 570–580, 2024

  8. [8]

    Development of a human evaluation framework and correlation with automated metrics for natural language generation of medical diagnoses.medRxiv, 2024

    Emma Croxford, Yanjun Gao, Brian Patterson, Daniel To, Samuel Tesch, Dmitriy Dligach, Anoop Mayampurath, Matthew M Churpek, and Majid Afshar. Development of a human evaluation framework and correlation with automated metrics for natural language generation of medical diagnoses.medRxiv, 2024

  9. [9]

    Medarabiq: Benchmarking large language models on arabic medical tasks

    Mouath Abu Daoud, Chaimae Abouzahir, Leen Kharouf, Walid Al-Eisawi, Nizar Habash, and Farah E Shamout. Medarabiq: Benchmarking large language models on arabic medical tasks. arXiv preprint arXiv:2505.03427, 2025

  10. [10]

    From text to treatment: the crucial role of validation for generative large language models in health care.The Lancet Digital Health, 6(7):e441–e443, 2024

    Anne de Hond, Tuur Leeuwenberg, Richard Bartels, Marieke van Buchem, Ilse Kant, Karel GM Moons, and Maarten van Smeden. From text to treatment: the crucial role of validation for generative large language models in health care.The Lancet Digital Health, 6(7):e441–e443, 2024

  11. [11]

    Deepseek-v3

    DeepSeek AI. Deepseek-v3. https://huggingface.co/deepseek-ai/deepseek-vl-7b ,

  12. [13]

    Doubao-1.5-thinking-pro.https://console.volcengine

    Doubao Team, V olcano Engine. Doubao-1.5-thinking-pro.https://console.volcengine. com/ark/region:ark+cn-beijing/model/detail?Id=doubao-1-5-thinking-pro ,

  13. [14]

    Accessed: 2025-08-01

  14. [15]

    Qafacteval: Improved qa-based factual consistency evaluation for summarization.arXiv preprint arXiv:2112.08542, 2021

    Alexander R Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. Qafacteval: Improved qa-based factual consistency evaluation for summarization.arXiv preprint arXiv:2112.08542, 2021

  15. [16]

    Summarizing patients’ problems from hospital progress notes using pre-trained sequence-to-sequence models

    Yanjun Gao, Timothy Miller, Dongfang Xu, Dmitriy Dligach, Matthew M Churpek, and Ma- jid Afshar. Summarizing patients’ problems from hospital progress notes using pre-trained sequence-to-sequence models. InProceedings of COLING. International Conference on Com- putational Linguistics, volume 2022, page 2979, 2022. 10

  16. [17]

    Gemini 2.0 and 2.5 flash models

    Google Research. Gemini 2.0 and 2.5 flash models. https://blog.google/technology/ ai/google-gemini-flash, 2025. Accessed: 2025-08-01

  17. [18]

    Ascleai: A llm-based clinical note management system for enhancing clinician productivity

    Jiyeon Han, Jimin Park, Jinyoung Huh, Uran Oh, Jaeyoung Do, and Daehee Kim. Ascleai: A llm-based clinical note management system for enhancing clinician productivity. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1–7, 2024

  18. [19]

    Out-of-distribution detection in medical image analysis: A survey.arXiv preprint arXiv:2404.18279, 2024

    Zesheng Hong, Yubiao Yue, Yubin Chen, Lele Cong, Huanjie Lin, Yuanmei Luo, Mini Han Wang, Weidong Wang, Jialong Xu, Xiaoqi Yang, et al. Out-of-distribution detection in medical image analysis: A survey.arXiv preprint arXiv:2404.18279, 2024

  19. [20]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

  20. [21]

    Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

    Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W Park. Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

  21. [22]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  22. [23]

    Camel: Communicative agents for" mind" exploration of large language model society.Advances in Neural Information Processing Systems, 36:51991–52008, 2023

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in Neural Information Processing Systems, 36:51991–52008, 2023

  23. [24]

    ROUGE: A package for automatic evaluation of summaries

    Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, 2004

  24. [25]

    Self-alignment pretraining for biomedical entity representations.arXiv preprint arXiv:2010.11784, 2020

    Fangyu Liu, Ehsan Shareghi, Zaiqiao Meng, Marco Basaldella, and Nigel Collier. Self-alignment pretraining for biomedical entity representations.arXiv preprint arXiv:2010.11784, 2020

  25. [26]

    A generalist medical language model for disease diagnosis assistance.Nature medicine, 31(3):932–942, 2025

    Xiaohong Liu, Hao Liu, Guoxing Yang, Zeyu Jiang, Shuguang Cui, Zhaoze Zhang, Huan Wang, Liyuan Tao, Yongchang Sun, Zhu Song, et al. A generalist medical language model for disease diagnosis assistance.Nature medicine, 31(3):932–942, 2025

  26. [27]

    On faithfulness and factuality in abstractive summarization.arXiv preprint arXiv:2005.00661, 2020

    Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization.arXiv preprint arXiv:2005.00661, 2020

  27. [28]

    A comprehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology, 16(5):1–72, 2025

    Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology, 16(5):1–72, 2025

  28. [29]

    Why We Need New Evaluation Metrics for NLG

    Jekaterina Novikova, Ondˇrej Dušek, Amanda Cercas Curry, and Verena Rieser. Why we need new evaluation metrics for nlg.arXiv preprint arXiv:1707.06875, 2017

  29. [30]

    Gpt-4o-mini model card

    OpenAI. Gpt-4o-mini model card. https://platform.openai.com/docs/models# gpt-4o, 2024. Accessed: 2025-08-01

  30. [31]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  31. [32]

    Python Software Foundation.Python Standard Library: difflib — Helpers for computing deltas,

  32. [33]

    Accessed: 2025-09-23

  33. [34]

    Em_mixers at mediqa-corr 2024: Knowledge-enhanced few-shot in-context learning for medical error detection and correction

    Swati Rajwal, Eugene Agichtein, and Abeed Sarker. Em_mixers at mediqa-corr 2024: Knowledge-enhanced few-shot in-context learning for medical error detection and correction. InProceedings of the 6th Clinical Natural Language Processing Workshop, 2024. 11

  34. [35]

    Medifact at MEDIQA-CORR 2024: Why AI needs a human touch

    Nadia Saeed. Medifact at MEDIQA-CORR 2024: Why AI needs a human touch. InProceedings of the 6th Clinical Natural Language Processing Workshop, ClinicalNLP@NAACL 2024, Mexico City, Mexico, June 21, 2024, pages 346–352, 2024

  35. [36]

    AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

    Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments. arXiv preprint arXiv:2405.07960, 2024

  36. [37]

    Thibault Sellam, Dipanjan Das, and Ankur P. Parikh. BLEURT: learning robust metrics for text generation. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, pages 7881–7892, 2020

  37. [38]

    Trustworthy and practical ai for healthcare: A guided deferral system with large language models

    Joshua Strong, Qianhui Men, and J Alison Noble. Trustworthy and practical ai for healthcare: A guided deferral system with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 28413–28421, 2025

  38. [39]

    Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

    Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

  39. [40]

    Hse nlp team at mediqa-corr 2024 task: In-prompt ensemble with entities and knowledge graph for medical error correction

    Airat Valiev and Elena Tutubalina. Hse nlp team at mediqa-corr 2024 task: In-prompt ensemble with entities and knowledge graph for medical error correction. InProceedings of the 6th Clinical Natural Language Processing Workshop, 2024

  40. [41]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

  41. [42]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, 2023

  42. [43]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  43. [44]

    Physician-and large language model–generated hospital discharge summaries

    Christopher YK Williams, Charumathi Raghu Subramanian, Syed Salman Ali, Michael Apoli- nario, Elisabeth Askin, Peter Barish, Monica Cheng, W James Deardorff, Nisha Donthi, Smitha Ganeshan, et al. Physician-and large language model–generated hospital discharge summaries. JAMA Internal Medicine, 2025

  44. [45]

    Autogen: Enabling next-gen llm applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst Conference on Language Modeling, 2024

  45. [46]

    Knowlab_aimed at mediqa-corr 2024: Chain-of-though (cot) prompting strategies for medical error detection and correction

    Zhaolong Wu, Abul Hasan, Jinge Wu, Yunsoo Kim, Jason Cheung, Teng Zhang, and Honghan Wu. Knowlab_aimed at mediqa-corr 2024: Chain-of-though (cot) prompting strategies for medical error detection and correction. Inproceedings of the 6th clinical natural language processing workshop, pages 353–359, 2024

  46. [47]

    Unilog: Automatic logging via llm and in-context learning

    Junjielong Xu, Ziang Cui, Yuan Zhao, Xu Zhang, Shilin He, Pinjia He, Liqun Li, Yu Kang, Qingwei Lin, Yingnong Dang, et al. Unilog: Automatic logging via llm and in-context learning. InProceedings of the 46th ieee/acm international conference on software engineering, pages 1–12, 2024

  47. [48]

    Bufang Yang, Siyang Jiang, Lilin Xu, Kaiwei Liu, Hai Li, Guoliang Xing, Hongkai Chen, Xiaofan Jiang, and Zhenyu Yan. Drhouse: An llm-empowered diagnostic reasoning system through harnessing outcomes from sensor data and expert knowledge.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 8(4):1–29, 2024

  48. [49]

    A large language model for electronic health records.NPJ digital medicine, 5(1):194, 2022

    Xi Yang, Aokun Chen, Nima PourNejatian, Hoo Chang Shin, Kaleb E Smith, Christopher Parisien, Colin Compas, Cheryl Martin, Anthony B Costa, Mona G Flores, et al. A large language model for electronic health records.NPJ digital medicine, 5(1):194, 2022. 12

  49. [50]

    Chatdiet: Empowering personalized nutrition-oriented food recommender chatbots through an llm-augmented framework.Smart Health, 32:100465, 2024

    Zhongqi Yang, Elahe Khatibi, Nitish Nagesh, Mahyar Abbasian, Iman Azimi, Ramesh Jain, and Amir M Rahmani. Chatdiet: Empowering personalized nutrition-oriented food recommender chatbots through an llm-augmented framework.Smart Health, 32:100465, 2024

  50. [51]

    Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

  51. [52]

    A multimodal multi-agent framework for radiology report generation.arXiv preprint arXiv:2505.09787, 2025

    Ziruo Yi, Ting Xiao, and Mark V Albert. A multimodal multi-agent framework for radiology report generation.arXiv preprint arXiv:2505.09787, 2025

  52. [53]

    Clinicalagent: Clinical trial multi-agent system with large language model-based reasoning

    Ling Yue, Sixue Xing, Jintai Chen, and Tianfan Fu. Clinicalagent: Clinical trial multi-agent system with large language model-based reasoning. InProceedings of the 15th ACM Interna- tional Conference on Bioinformatics, Computational Biology and Health Informatics, pages 1–10, 2024

  53. [54]

    Weinberger, and Yoav Artzi

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with BERT. InInternational Conference on Learning Representations (ICLR), 2020

  54. [55]

    Kg4diagnosis: A hierarchical multi-agent llm framework with knowledge graph enhancement for medical diagnosis

    Kaiwen Zuo, Yirui Jiang, Fan Mo, and Pietro Lio. Kg4diagnosis: A hierarchical multi-agent llm framework with knowledge graph enhancement for medical diagnosis. InAAAI Bridge Program on AI for Medicine and Healthcare, pages 195–204. PMLR, 2025. A Appendix A.1 Datasets, Baseline Models and Evaluation Metrics We conducted experiments on the publicly availabl...

  55. [56]

    sepsis” is not treated as a keyword because it represents an existing diagnosis at admission (“Mr. <NAME/> was treated for sepsis, present on admission

    Tables 7 and 8 use Gpt-4o-mini as the base model forMedGuards. In Table 7, most other model mixtures lead to slightly lower performance. However, we also observe an interesting case: when Gemini 2.0 Flash is introduced as an additional agent in the localization stage (serving as an agreement agent), the localization and correction performance increase. Th...

  56. [57]

    CORRECT". If the text has a medical error return one word

    Other treatments include: Patient on pres- sure reducing surface,zinc citratemoisture barrier, Assist patient to obtain adequate nu- trition. Ms. <NAME/> is being managed for Stage 3 R and L coccyx pressure ulcers, present on admit (POA). Patient is a <AGE/> YO F recently discharged after a 1 month hos- pitalization for recurrent cirrhosis of trans- plant...