arxiv: 2604.10389 · v1 · submitted 2026-04-12 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection

Saukun Thika You , Nguyen Anh Khoa Tran , Wesley K. Marizane , Hanshu Rao , Qiunan Zhang , Xiaolei Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords clinical error detectionterminology substitutionmulti-agent debateretrieval-augmented generationhealthcare NLPmedical noteserror detection benchmarkmulti-agent systems

0 comments

The pith

A multi-agent debate system augmented with hybrid retrieval detects terminology substitution errors in clinical notes more accurately than single-agent RAG or debate-only approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BLUEmed to address terminology substitution errors, where a medical term in a clinical note is replaced by a linguistically valid but clinically incorrect one. It breaks each note into sub-queries, gathers evidence via dense, sparse, and online retrieval, and pits two domain-expert agents with separate knowledge bases against each other. When they disagree, a structured debate and cross-source check resolve the issue, followed by a safety filter for false positives. A reader would care because undetected substitutions can lead to flawed patient care, and the results show consistent gains under few-shot prompting across multiple models.

Core claim

BLUEmed decomposes each clinical note into focused sub-queries, retrieves source-partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter-argumentation round and cross-source adjudication resolve the conflict, followed by a cascading safety layer that filters common false-positive patterns. This produces the highest accuracy, ROC-AUC, and PR-AUC on a clinical terminology substitution detection benchmark under few-shot prompting.

What carries the argument

The BLUEmed framework, which pairs hybrid retrieval-augmented generation with structured multi-agent debate between two domain-expert agents plus a cascading safety layer to resolve conflicts and filter errors.

If this is right

Retrieval augmentation and structured debate act as complementary components that together raise detection performance.
The framework delivers its strongest results when paired with models that already have strong instruction-following and clinical language capabilities.
Improvements appear consistently across both proprietary and open-source backbone models under both zero-shot and few-shot prompting.
Few-shot prompting produces higher accuracy, ROC-AUC, and PR-AUC than zero-shot prompting for this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition-plus-debate structure could be tested on other categories of clinical documentation errors such as dosage mistakes or missing context.
Embedding the framework inside electronic health record workflows might allow real-time flagging before notes are finalized.
Varying the number or specialization of the expert agents could reveal how much additional perspective helps versus adding noise.

Load-bearing premise

The clinical terminology substitution detection benchmark reflects real-world clinical notes and error patterns, and the two domain-expert agents hold enough reliable clinical knowledge to analyze notes without introducing new hallucinations.

What would settle it

Running the full BLUEmed pipeline on a large set of de-identified real hospital clinical notes and comparing its error detections against independent reviews by multiple clinical experts would show whether the reported gains hold outside the benchmark.

Figures

Figures reproduced from arXiv: 2604.10389 by Hanshu Rao, Nguyen Anh Khoa Tran, Qiunan Zhang, Saukun Thika You, Wesley K. Marizane, Xiaolei Huang.

**Figure 1.** Figure 1: The BLUEmed framework. The pipeline consists of a Hybrid RAG (combining dense, sparse, and online search) with a multi-agent debate structure in which experts present their respective arguments and a judge model validates the final output, with an integrated hybrid safety layer to ensure medical accuracy. in clinical notes ( [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

Terminology substitution errors in clinical notes, where one medical term is replaced by a linguistically valid but clinically different term, pose a persistent challenge for automated error detection in healthcare. We introduce BLUEmed, a multi-agent debate framework augmented with hybrid Retrieval-Augmented Generation (RAG) that combines evidence-grounded reasoning with multi-perspective verification for clinical error detection. BLUEmed decomposes each clinical note into focused sub-queries, retrieves source-partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter-argumentation round and cross-source adjudication resolve the conflict, followed by a cascading safety layer that filters common false-positive patterns. We evaluate BLUEmed on a clinical terminology substitution detection benchmark under both zero-shot and few-shot prompting with multiple backbone models spanning proprietary and open-source families. Experimental results show that BLUEmed achieves the best accuracy (69.13%), ROC-AUC (74.45%), and PR-AUC (72.44%) under few-shot prompting, outperforming both single-agent RAG and debate-only baselines. Further analyses across six backbone models and two prompting strategies confirm that retrieval augmentation and structured debate are complementary, and that the framework benefits most from models with sufficient instruction-following and clinical language understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BLUEmed combines sub-query RAG with two-agent debate and a safety filter for spotting terminology swaps in clinical notes, but the benchmark and agent-knowledge details are too thin to trust the reported gains.

read the letter

The main thing to know is that BLUEmed takes existing retrieval-augmented generation and multi-agent debate ideas and wires them together specifically for terminology substitution errors in clinical notes. It breaks notes into sub-queries, pulls evidence from partitioned sources with dense, sparse, and web retrieval, gives two agents different knowledge bases, runs a structured counter-argument round on disagreements, and adds a cascading filter for common false positives. On their test set it reaches 69% accuracy under few-shot prompting and beats the single-agent RAG and debate-only baselines across six backbone models.

Referee Report

2 major / 2 minor

Summary. The paper introduces BLUEmed, a retrieval-augmented multi-agent debate framework for detecting terminology substitution errors in clinical notes. It decomposes notes into sub-queries, performs hybrid RAG (dense, sparse, and online retrieval), assigns two domain-expert agents distinct knowledge bases for independent analyses, resolves disagreements through structured counter-argumentation and cross-source adjudication, and applies a cascading safety filter. Under few-shot prompting, BLUEmed reports the highest accuracy (69.13%), ROC-AUC (74.45%), and PR-AUC (72.44%) across six backbone models, outperforming single-agent RAG and debate-only baselines; the authors conclude that retrieval and debate are complementary.

Significance. If the benchmark is representative of real clinical notes and the agent disagreements reflect genuine clinical signal, the framework offers a concrete way to combine evidence grounding with multi-perspective verification, potentially reducing hallucinations in clinical error detection. The cross-model and cross-prompting analysis provides evidence that the gains are not tied to a single LLM family. The manuscript supplies explicit baseline comparisons and reports both ROC-AUC and PR-AUC, which is a strength for an imbalanced detection task.

major comments (2)

[Abstract and Evaluation section] Abstract and Evaluation section: the headline performance numbers (69.13% accuracy, 74.45% ROC-AUC, 72.44% PR-AUC) are presented without any description of benchmark construction, substitution-generation procedure, clinical-note provenance, note-length statistics, or error-distribution validation. Because the central claim is that BLUEmed outperforms baselines on this benchmark, the absence of these details makes it impossible to determine whether the reported gains are robust or artifacts of the test distribution.
[Methods section] Methods section: the two domain-expert agents are said to possess 'distinct knowledge bases,' yet no specification is given of how those bases differ from each other or from the base LLM's pretraining data, nor is there any validation (human or automated) that disagreements are resolved by clinical content rather than prompt artifacts. This directly affects the load-bearing claim that the multi-agent debate component contributes beyond single-agent RAG.

minor comments (2)

[Abstract] The abstract states results 'across six backbone models and two prompting strategies' but does not report per-model variance, statistical significance tests, or confidence intervals; adding these would strengthen the empirical claims without altering the central argument.
Notation for the hybrid retrieval components (dense, sparse, online) and the safety-layer false-positive patterns is introduced without a compact table or diagram; a small schematic would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and indicate the revisions we will make to improve the clarity and completeness of the paper.

read point-by-point responses

Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the headline performance numbers (69.13% accuracy, 74.45% ROC-AUC, 72.44% PR-AUC) are presented without any description of benchmark construction, substitution-generation procedure, clinical-note provenance, note-length statistics, or error-distribution validation. Because the central claim is that BLUEmed outperforms baselines on this benchmark, the absence of these details makes it impossible to determine whether the reported gains are robust or artifacts of the test distribution.

Authors: We agree that additional details on the benchmark are necessary to allow proper evaluation of our results. In the revised manuscript, we will expand the Evaluation section (and update the abstract if space permits) to describe the benchmark construction, the substitution-generation procedure, the provenance of the clinical notes, note-length statistics, and validation of the error distribution. These additions will provide context for assessing whether the performance improvements are robust. revision: yes
Referee: [Methods section] Methods section: the two domain-expert agents are said to possess 'distinct knowledge bases,' yet no specification is given of how those bases differ from each other or from the base LLM's pretraining data, nor is there any validation (human or automated) that disagreements are resolved by clinical content rather than prompt artifacts. This directly affects the load-bearing claim that the multi-agent debate component contributes beyond single-agent RAG.

Authors: We acknowledge the need for greater specification of the agents' knowledge bases. In the revised Methods section, we will detail how the distinct knowledge bases are constructed and how they differ from each other and the base model's pretraining data. We will also include an analysis validating that disagreements are driven by clinical content, for example through a case study or automated checks on a subset of examples. This will better support the contribution of the multi-agent debate. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation on external benchmark

full rationale

The paper describes a multi-agent RAG+debate architecture for clinical error detection and reports accuracy/ROC/PR-AUC numbers on a terminology-substitution benchmark, with explicit comparisons to single-agent RAG and debate-only baselines. No equations, fitted parameters, or derivations appear in the provided text. No self-citations are invoked to justify uniqueness or ansatz choices. The performance claims are direct empirical measurements against held-out data rather than quantities defined in terms of the model's own outputs or prior self-referential results. This is a standard system paper whose central claims rest on external benchmark comparison and therefore receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim depends on the assumption that retrieved medical evidence is accurate and that the chosen benchmark reflects real clinical error patterns. No new physical entities are postulated. The framework introduces no free parameters beyond standard design choices such as the number of agents and prompting regime.

free parameters (2)

number of expert agents
Framework design choice of exactly two agents with separate knowledge bases.
prompting regime
Performance is reported specifically under few-shot prompting.

axioms (2)

domain assumption Retrieved evidence from dense, sparse, and online sources is clinically accurate and relevant.
The entire RAG component rests on this assumption about retrieval quality.
domain assumption The clinical terminology substitution detection benchmark contains representative real-world errors.
Evaluation validity depends on this representativeness claim.

pith-pipeline@v0.9.0 · 5554 in / 1584 out tokens · 68504 ms · 2026-05-10T16:40:22.999045+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean (and Cost/FunctionalEquation.lean) reality_from_one_distinction; washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BLUEmed decomposes each clinical note into focused sub-queries, retrieves source-partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter-argumentation round and cross-source adjudication resolve the conflict, followed by a cascading safety layer

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 9 canonical work pages · 4 internal anchors

[1]

The impact of health information technology on patient safety,

Y . K. Alotaibi and F. Federico, “The impact of health information technology on patient safety,”Saudi medical journal, vol. 38, no. 12, p. 1173, 2017

2017
[2]

A scoping review of synthetic data generation by language models in biomedical research and application: Data utility and quality perspectives,

H. Rao, W. Liu, H. Wang, I.-C. Huang, Z. He, and X. Huang, “A scoping review of synthetic data generation by language models in biomedical research and application: Data utility and quality perspectives,”Journal of Healthcare Informatics Research, pp. 1–26, 2026

2026
[3]

Preventing medication errors: a summary,

D. W. Bates, “Preventing medication errors: a summary,”American Journal of Health-System Pharmacy, vol. 64, no. 14 Supplement 9, pp. S3–S9, 2007

2007
[4]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,” inAdvances in neural information processing systems, vol. 33, 2020, pp. 9459–9474

2020
[5]

Retrieval-augmented generation for large language models: A survey,

Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, H. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” 2023

2023
[6]

Retrieval augmentation reduces hallucination in conversation,

K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston, “Retrieval augmentation reduces hallucination in conversation,” inFindings of the Association for Computational Linguistics: EMNLP 2021, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds. Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 3784–3803

2021
[7]

Improving factuality and reasoning in language models through multiagent debate,

Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inForty-first International Conference on Machine Learning, 2023

2023
[8]

Encouraging divergent thinking in large language models through multi-agent debate,

T. Liang, Z. He, W. Jiao, X. Wang, Y . Wang, R. Wang, Y . Yang, S. Shi, and Z. Tu, “Encouraging divergent thinking in large language models through multi-agent debate,” inProceedings of the 2024 EMNLP, 2024, pp. 17 889–17 904

2024
[9]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

C.-M. Chan, W. Chen, Y . Su, J. Yu, W. Xue, S. Zhang, J. Fu, and Z. Liu, “Chateval: Towards better llm-based evaluators through multi- agent debate,”arXiv preprint arXiv:2308.07201, 2023

work page internal anchor Pith review arXiv 2023
[10]

Examining inter- consistency of large language models collaboration: An in-depth analysis via debate,

K. Xiong, X. Ding, Y . Cao, T. Liu, and B. Qin, “Examining inter- consistency of large language models collaboration: An in-depth analysis via debate,” inFindings of the Association for Computational Linguis- tics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 7572–7590

2023
[11]

Recognizing obesity and comorbidities in sparse data,

¨O. Uzuner, “Recognizing obesity and comorbidities in sparse data,” Journal of the American Medical Informatics Association, vol. 16, no. 4, pp. 561–570, 2009

2009
[12]

A review of approaches to identifying patient phenotype cohorts using electronic health records,

C. Shivade, P. Raghavan, E. Fosler-Lussier, P. J. Embi, N. Elhadad, S. B. Johnson, and A. M. Lai, “A review of approaches to identifying patient phenotype cohorts using electronic health records,”Journal of the American Medical Informatics Association, vol. 21, no. 2, pp. 221–230, 2014

2014
[13]

Enhancing noise robustness of retrieval-augmented language models with adaptive adversarial training,

F. Fang, Y . Bai, S. Ni, M. Yang, X. Chen, and R. Xu, “Enhancing noise robustness of retrieval-augmented language models with adaptive adversarial training,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 10 028–10 039

2024
[14]

Making retrieval- augmented language models robust to irrelevant context,

O. Yoran, T. Wolfson, O. Ram, and J. Berant, “Making retrieval- augmented language models robust to irrelevant context,” inThe Twelfth International Conference on Learning Representations, 2024

2024
[15]

Chain-of-note: Enhancing robustness in retrieval-augmented language models,

W. Yu, H. Zhang, X. Pan, P. Cao, K. Ma, J. Li, H. Wang, and D. Yu, “Chain-of-note: Enhancing robustness in retrieval-augmented language models,” inProceedings of the 2024 conference on empirical methods in natural language processing, 2024, pp. 14 672–14 685

2024
[16]

Self-RAG: Learning to retrieve, generate, and critique through self-reflection,

A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi, “Self-RAG: Learning to retrieve, generate, and critique through self-reflection,” inThe Twelfth International Conference on Learning Representations, 2024

2024
[17]

Ragtruth: A hallucination corpus for developing trustworthy retrieval- augmented language models,

C. Niu, Y . Wu, J. Zhu, S. Xu, K. Shum, R. Zhong, J. Song, and T. Zhang, “Ragtruth: A hallucination corpus for developing trustworthy retrieval- augmented language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 10 862–10 878

2024
[18]

Retrieval- augmented generation with estimation of source reliability,

J. Hwang, J. Park, H. Park, D. Kim, S. Park, and J. Ok, “Retrieval- augmented generation with estimation of source reliability,” inProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 34 267–34 291

2025
[19]

Calibrate before use: Improving few-shot performance of language models,

Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh, “Calibrate before use: Improving few-shot performance of language models,” in International conference on machine learning. PMLR, 2021, pp. 12 697–12 706

2021
[20]

Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation,

D. Ru, L. Qiu, X. Hu, T. Zhang, P. Shi, S. Chang, C. Jiayang, C. Wang, S. Sun, H. Liet al., “Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 21 999–22 027

2024
[21]

On scalable oversight with weak llms judging strong llms,

Z. Kenton, N. Siegel, J. Kram ´ar, J. Brown-Cohen, S. Albanie, J. Bulian, R. Agarwal, D. Lindner, Y . Tang, N. Goodmanet al., “On scalable oversight with weak llms judging strong llms,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 75 229–75 276

2024
[22]

Reciprocal rank fusion outperforms condorcet and individual rank learning methods,

G. V . Cormack, C. L. Clarke, and S. Buettcher, “Reciprocal rank fusion outperforms condorcet and individual rank learning methods,” inProceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, 2009, pp. 758–759

2009
[23]

MEDEC: A benchmark for medical error detection and correction in clinical notes,

A. Ben Abacha, W.-w. Yim, Y . Fu, Z. Sun, M. Yetisgen, F. Xia, and T. Lin, “MEDEC: A benchmark for medical error detection and correction in clinical notes,” inFindings of the Association for Compu- tational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025...

2025
[24]

Evaluation of a natural language processing approach to identify diagnostic errors and analysis of safety learning system case review data: retrospective cohort study,

A. Tabaie, A. Tran, T. Calabria, S. S. Bennett, A. Milicia, W. Weintraub, W. J. Gallagher, J. Yosaitis, L. C. Schubel, M. A. Hillet al., “Evaluation of a natural language processing approach to identify diagnostic errors and analysis of safety learning system case review data: retrospective cohort study,”Journal of Medical Internet Research, vol. 26, p. e...

2024
[25]

Extracting adverse drug events from clinical notes: A systematic review of ap- proaches used,

S. Modi, K. A. Kasmiran, N. M. Sharef, and M. Y . Sharum, “Extracting adverse drug events from clinical notes: A systematic review of ap- proaches used,”Journal of Biomedical Informatics, vol. 151, p. 104603, 2024

2024
[26]

Bm25 query augmentation learned end-to- end,

X. Chen and S. Wiseman, “Bm25 query augmentation learned end-to- end,” 2023

2023
[27]

Rag-fusion: a new take on retrieval-augmented gener- ation,

Z. Rackauckas, “Rag-fusion: a new take on retrieval-augmented gener- ation,” 2024

2024
[28]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “GPT-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

GPT-5.2 model,

OpenAI, “GPT-5.2 model,” [Online]. Available: https://developers. openai.com/api/docs/models/gpt-5.2, 2025, openAI API model docu- mentation. Accessed: Feb. 7, 2026

2025
[30]

Gemini 2.0 Flash model card,

Google DeepMind, “Gemini 2.0 Flash model card,” [Online]. Available: https://modelcards.withgoogle.com/assets/documents/gemini-2-flash. pdf, Apr. 2025, accessed: Feb. 7, 2026

2025
[31]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

The llama 3 herd of models,

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,” pp. arXiv–2407, 2024

2024
[33]

Gemini Embedding: Generalizable Embeddings from Gemini

J. Lee, F. Chen, S. Dua, D. Cer, M. Shanbhogue, I. Naim, G. H. ´Abrego, Z. Li, K. Chen, H. S. Veraet al., “Gemini embedding: Generalizable embeddings from gemini,”arXiv preprint arXiv:2503.07891, 2025

work page internal anchor Pith review arXiv 2025
[34]

Overview of the mediqa-corr 2024 shared task on medical error detection and correction,

A. B. Abacha, W.-w. Yim, Y . Fu, Z. Sun, F. Xia, and M. Yetisgen- Yildiz, “Overview of the mediqa-corr 2024 shared task on medical error detection and correction,” inProceedings of the 6th Clinical Natural Language Processing Workshop, 2024, pp. 596–603

2024
[35]

MedHallu: A comprehensive benchmark for detecting medical hal- lucinations in large language models,

S. Pandit, J. Xu, J. Hong, Z. Wang, T. Chen, K. Xu, and Y . Ding, “MedHallu: A comprehensive benchmark for detecting medical hal- lucinations in large language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for ...

2025
[36]

Comparing two model designs for clinical note generation; is an LLM a useful evaluator of consistency?

N. Brake and T. Schaaf, “Comparing two model designs for clinical note generation; is an LLM a useful evaluator of consistency?” inFindings of the Association for Computational Linguistics: NAACL 2024, K. Duh, H. Gomez, and S. Bethard, Eds. Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024, pp. 352–363

2024
[37]

From feedback to checklists: Grounded evaluation of ai-generated clinical notes,

K. Zhou, J. M. Giorgi, P. Mani, P. Xu, D. Liang, and C. Tan, “From feedback to checklists: Grounded evaluation of ai-generated clinical notes,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2025, pp. 1485–1499

2025
[38]

SYNFAC-EDIT: Synthetic imitation edit feedback for factual alignment in clinical summarization,

P. Mishra, Z. Yao, P. Vashisht, F. Ouyang, B. Wang, V . D. Mody, and H. Yu, “SYNFAC-EDIT: Synthetic imitation edit feedback for factual alignment in clinical summarization,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 20 061–20 083

2024
[39]

Tracsum: A new benchmark for aspect-based summarization with sentence-level traceability in medical domain,

B. Chu, M. Li, S. Frihat, C. Gu, G. Lodde, E. Livingstone, and N. Fuhr, “Tracsum: A new benchmark for aspect-based summarization with sentence-level traceability in medical domain,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 844–864

2025
[40]

Benchmarking retrieval- augmented generation for medicine,

G. Xiong, Q. Jin, Z. Lu, and A. Zhang, “Benchmarking retrieval- augmented generation for medicine,” inFindings of the Association for Computational Linguistics ACL 2024, 2024, pp. 6233–6251

2024
[41]

Cluster-based partial dense retrieval fused with sparse text retrieval,

Y . Yang, P. Carlson, S. He, Y . Qiao, and T. Yang, “Cluster-based partial dense retrieval fused with sparse text retrieval,” inProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp. 2327–2331

2024
[42]

Poqd: Performance- oriented query decomposer for multi-vector retrieval,

Y . Liu, J. Li, Y . Wu, and Z. Chen, “Poqd: Performance- oriented query decomposer for multi-vector retrieval,”arXiv preprint arXiv:2505.19189, 2025

work page arXiv 2025
[43]

Multi-llm debate: Framework, principals, and interventions,

A. Estornell and Y . Liu, “Multi-llm debate: Framework, principals, and interventions,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 28 938–28 964

2024
[44]

Hallucination detection in structured query generation via llm self-debating,

M. Li, J. Chen, M. Xu, and X. Wang, “Hallucination detection in structured query generation via llm self-debating,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025, pp. 16 102–16 113

2025
[45]

Harnessing large language models as post-hoc correctors,

Z. Zhong, K. Zhou, and D. Mottin, “Harnessing large language models as post-hoc correctors,”arXiv preprint arXiv:2402.13414, 2024

work page arXiv 2024
[46]

D., and Pretorius, A

A. Smit, P. Duckworth, N. Grinsztajn, T. D. Barrett, and A. Pretorius, “Should we be going mad? a look at multi-agent debate strategies for llms,”arXiv preprint arXiv:2311.17371, 2023

work page arXiv 2023
[47]

Can llms judge debates? evaluating non-linear reasoning via argumentation theory se- mantics,

R. Sanayei, S. Vesic, E. Blanco, and M. Surdeanu, “Can llms judge debates? evaluating non-linear reasoning via argumentation theory se- mantics,”arXiv preprint arXiv:2509.15739, 2025

work page arXiv 2025
[48]

Multi-agent debate for LLM judges with adaptive stability detection,

T. Hu, Z. Tan, S. Wang, H. Qu, and T. Chen, “Multi-agent debate for LLM judges with adaptive stability detection,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id=Vusd1Hw2D9

2025
[49]

Faithful, unfaithful or ambiguous? multi-agent debate with initial stance for summary evaluation,

M. Koupaee, J. W. Vincent, S. Mansour, I. Shalyminov, H. He, H. Song, R. Shu, J. He, Y . Nian, A. W.-m. Wong, K. J. Han, and H. Su, “Faithful, unfaithful or ambiguous? multi-agent debate with initial stance for summary evaluation,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics:...

2025
[50]

R 2-Guard: Robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning,

M. Kang and B. Li, “R 2-Guard: Robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning,” inThe Thirteenth International Conference on Learning Representations, 2025

2025
[51]

C-safegen: Certified safe LLM generation with claim-based streaming guardrails,

M. Kang, Z. Chen, and B. Li, “C-safegen: Certified safe LLM generation with claim-based streaming guardrails,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[52]

Slot: Structuring the output of large language models,

D. Y .-B. Wang, Z. Shen, S. S. Mishra, Z. Xu, Y . Teng, and H. Ding, “Slot: Structuring the output of large language models,”arXiv preprint arXiv:2505.04016, 2025

work page arXiv 2025
[53]

Disambiguation of acronyms in clinical narratives with large language models,

A. Kugic, S. Schulz, and M. Kreuzthaler, “Disambiguation of acronyms in clinical narratives with large language models,”Journal of the American Medical Informatics Association, vol. 31, no. 9, pp. 2040– 2046, 2024

2040
[54]

Selective generation for con- trollable language models,

M. Lee, K. Kim, T. Kim, and S. Park, “Selective generation for con- trollable language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 50 494–50 527, 2024

2024
[55]

Cultivating multidisciplinary research and education on gpu infrastructure for mid-south institutions at the university of memphis: Practice and challenge,

M. Sharif, G. Han, W. Liu, and X. Huang, “Cultivating multidisciplinary research and education on gpu infrastructure for mid-south institutions at the university of memphis: Practice and challenge,” 2025. APPENDIX SYSTEMPROMPTS This appendix documents the system prompts used by BLUEmed for the two domain experts and the adjudicator judge. To avoid redunda...

2025