Recognition: 1 theorem link
· Lean TheoremBLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection
Pith reviewed 2026-05-10 16:40 UTC · model grok-4.3
The pith
A multi-agent debate system augmented with hybrid retrieval detects terminology substitution errors in clinical notes more accurately than single-agent RAG or debate-only approaches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BLUEmed decomposes each clinical note into focused sub-queries, retrieves source-partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter-argumentation round and cross-source adjudication resolve the conflict, followed by a cascading safety layer that filters common false-positive patterns. This produces the highest accuracy, ROC-AUC, and PR-AUC on a clinical terminology substitution detection benchmark under few-shot prompting.
What carries the argument
The BLUEmed framework, which pairs hybrid retrieval-augmented generation with structured multi-agent debate between two domain-expert agents plus a cascading safety layer to resolve conflicts and filter errors.
If this is right
- Retrieval augmentation and structured debate act as complementary components that together raise detection performance.
- The framework delivers its strongest results when paired with models that already have strong instruction-following and clinical language capabilities.
- Improvements appear consistently across both proprietary and open-source backbone models under both zero-shot and few-shot prompting.
- Few-shot prompting produces higher accuracy, ROC-AUC, and PR-AUC than zero-shot prompting for this task.
Where Pith is reading between the lines
- The same decomposition-plus-debate structure could be tested on other categories of clinical documentation errors such as dosage mistakes or missing context.
- Embedding the framework inside electronic health record workflows might allow real-time flagging before notes are finalized.
- Varying the number or specialization of the expert agents could reveal how much additional perspective helps versus adding noise.
Load-bearing premise
The clinical terminology substitution detection benchmark reflects real-world clinical notes and error patterns, and the two domain-expert agents hold enough reliable clinical knowledge to analyze notes without introducing new hallucinations.
What would settle it
Running the full BLUEmed pipeline on a large set of de-identified real hospital clinical notes and comparing its error detections against independent reviews by multiple clinical experts would show whether the reported gains hold outside the benchmark.
Figures
read the original abstract
Terminology substitution errors in clinical notes, where one medical term is replaced by a linguistically valid but clinically different term, pose a persistent challenge for automated error detection in healthcare. We introduce BLUEmed, a multi-agent debate framework augmented with hybrid Retrieval-Augmented Generation (RAG) that combines evidence-grounded reasoning with multi-perspective verification for clinical error detection. BLUEmed decomposes each clinical note into focused sub-queries, retrieves source-partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter-argumentation round and cross-source adjudication resolve the conflict, followed by a cascading safety layer that filters common false-positive patterns. We evaluate BLUEmed on a clinical terminology substitution detection benchmark under both zero-shot and few-shot prompting with multiple backbone models spanning proprietary and open-source families. Experimental results show that BLUEmed achieves the best accuracy (69.13%), ROC-AUC (74.45%), and PR-AUC (72.44%) under few-shot prompting, outperforming both single-agent RAG and debate-only baselines. Further analyses across six backbone models and two prompting strategies confirm that retrieval augmentation and structured debate are complementary, and that the framework benefits most from models with sufficient instruction-following and clinical language understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BLUEmed, a retrieval-augmented multi-agent debate framework for detecting terminology substitution errors in clinical notes. It decomposes notes into sub-queries, performs hybrid RAG (dense, sparse, and online retrieval), assigns two domain-expert agents distinct knowledge bases for independent analyses, resolves disagreements through structured counter-argumentation and cross-source adjudication, and applies a cascading safety filter. Under few-shot prompting, BLUEmed reports the highest accuracy (69.13%), ROC-AUC (74.45%), and PR-AUC (72.44%) across six backbone models, outperforming single-agent RAG and debate-only baselines; the authors conclude that retrieval and debate are complementary.
Significance. If the benchmark is representative of real clinical notes and the agent disagreements reflect genuine clinical signal, the framework offers a concrete way to combine evidence grounding with multi-perspective verification, potentially reducing hallucinations in clinical error detection. The cross-model and cross-prompting analysis provides evidence that the gains are not tied to a single LLM family. The manuscript supplies explicit baseline comparisons and reports both ROC-AUC and PR-AUC, which is a strength for an imbalanced detection task.
major comments (2)
- [Abstract and Evaluation section] Abstract and Evaluation section: the headline performance numbers (69.13% accuracy, 74.45% ROC-AUC, 72.44% PR-AUC) are presented without any description of benchmark construction, substitution-generation procedure, clinical-note provenance, note-length statistics, or error-distribution validation. Because the central claim is that BLUEmed outperforms baselines on this benchmark, the absence of these details makes it impossible to determine whether the reported gains are robust or artifacts of the test distribution.
- [Methods section] Methods section: the two domain-expert agents are said to possess 'distinct knowledge bases,' yet no specification is given of how those bases differ from each other or from the base LLM's pretraining data, nor is there any validation (human or automated) that disagreements are resolved by clinical content rather than prompt artifacts. This directly affects the load-bearing claim that the multi-agent debate component contributes beyond single-agent RAG.
minor comments (2)
- [Abstract] The abstract states results 'across six backbone models and two prompting strategies' but does not report per-model variance, statistical significance tests, or confidence intervals; adding these would strengthen the empirical claims without altering the central argument.
- Notation for the hybrid retrieval components (dense, sparse, online) and the safety-layer false-positive patterns is introduced without a compact table or diagram; a small schematic would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and indicate the revisions we will make to improve the clarity and completeness of the paper.
read point-by-point responses
-
Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the headline performance numbers (69.13% accuracy, 74.45% ROC-AUC, 72.44% PR-AUC) are presented without any description of benchmark construction, substitution-generation procedure, clinical-note provenance, note-length statistics, or error-distribution validation. Because the central claim is that BLUEmed outperforms baselines on this benchmark, the absence of these details makes it impossible to determine whether the reported gains are robust or artifacts of the test distribution.
Authors: We agree that additional details on the benchmark are necessary to allow proper evaluation of our results. In the revised manuscript, we will expand the Evaluation section (and update the abstract if space permits) to describe the benchmark construction, the substitution-generation procedure, the provenance of the clinical notes, note-length statistics, and validation of the error distribution. These additions will provide context for assessing whether the performance improvements are robust. revision: yes
-
Referee: [Methods section] Methods section: the two domain-expert agents are said to possess 'distinct knowledge bases,' yet no specification is given of how those bases differ from each other or from the base LLM's pretraining data, nor is there any validation (human or automated) that disagreements are resolved by clinical content rather than prompt artifacts. This directly affects the load-bearing claim that the multi-agent debate component contributes beyond single-agent RAG.
Authors: We acknowledge the need for greater specification of the agents' knowledge bases. In the revised Methods section, we will detail how the distinct knowledge bases are constructed and how they differ from each other and the base model's pretraining data. We will also include an analysis validating that disagreements are driven by clinical content, for example through a case study or automated checks on a subset of examples. This will better support the contribution of the multi-agent debate. revision: yes
Circularity Check
No circularity; empirical evaluation on external benchmark
full rationale
The paper describes a multi-agent RAG+debate architecture for clinical error detection and reports accuracy/ROC/PR-AUC numbers on a terminology-substitution benchmark, with explicit comparisons to single-agent RAG and debate-only baselines. No equations, fitted parameters, or derivations appear in the provided text. No self-citations are invoked to justify uniqueness or ansatz choices. The performance claims are direct empirical measurements against held-out data rather than quantities defined in terms of the model's own outputs or prior self-referential results. This is a standard system paper whose central claims rest on external benchmark comparison and therefore receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of expert agents
- prompting regime
axioms (2)
- domain assumption Retrieved evidence from dense, sparse, and online sources is clinically accurate and relevant.
- domain assumption The clinical terminology substitution detection benchmark contains representative real-world errors.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.lean (and Cost/FunctionalEquation.lean)reality_from_one_distinction; washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
BLUEmed decomposes each clinical note into focused sub-queries, retrieves source-partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter-argumentation round and cross-source adjudication resolve the conflict, followed by a cascading safety layer
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The impact of health information technology on patient safety,
Y . K. Alotaibi and F. Federico, “The impact of health information technology on patient safety,”Saudi medical journal, vol. 38, no. 12, p. 1173, 2017
2017
-
[2]
A scoping review of synthetic data generation by language models in biomedical research and application: Data utility and quality perspectives,
H. Rao, W. Liu, H. Wang, I.-C. Huang, Z. He, and X. Huang, “A scoping review of synthetic data generation by language models in biomedical research and application: Data utility and quality perspectives,”Journal of Healthcare Informatics Research, pp. 1–26, 2026
2026
-
[3]
Preventing medication errors: a summary,
D. W. Bates, “Preventing medication errors: a summary,”American Journal of Health-System Pharmacy, vol. 64, no. 14 Supplement 9, pp. S3–S9, 2007
2007
-
[4]
Retrieval- augmented generation for knowledge-intensive nlp tasks,
P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,” inAdvances in neural information processing systems, vol. 33, 2020, pp. 9459–9474
2020
-
[5]
Retrieval-augmented generation for large language models: A survey,
Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, H. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” 2023
2023
-
[6]
Retrieval augmentation reduces hallucination in conversation,
K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston, “Retrieval augmentation reduces hallucination in conversation,” inFindings of the Association for Computational Linguistics: EMNLP 2021, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds. Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 3784–3803
2021
-
[7]
Improving factuality and reasoning in language models through multiagent debate,
Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inForty-first International Conference on Machine Learning, 2023
2023
-
[8]
Encouraging divergent thinking in large language models through multi-agent debate,
T. Liang, Z. He, W. Jiao, X. Wang, Y . Wang, R. Wang, Y . Yang, S. Shi, and Z. Tu, “Encouraging divergent thinking in large language models through multi-agent debate,” inProceedings of the 2024 EMNLP, 2024, pp. 17 889–17 904
2024
-
[9]
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
C.-M. Chan, W. Chen, Y . Su, J. Yu, W. Xue, S. Zhang, J. Fu, and Z. Liu, “Chateval: Towards better llm-based evaluators through multi- agent debate,”arXiv preprint arXiv:2308.07201, 2023
work page internal anchor Pith review arXiv 2023
-
[10]
Examining inter- consistency of large language models collaboration: An in-depth analysis via debate,
K. Xiong, X. Ding, Y . Cao, T. Liu, and B. Qin, “Examining inter- consistency of large language models collaboration: An in-depth analysis via debate,” inFindings of the Association for Computational Linguis- tics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 7572–7590
2023
-
[11]
Recognizing obesity and comorbidities in sparse data,
¨O. Uzuner, “Recognizing obesity and comorbidities in sparse data,” Journal of the American Medical Informatics Association, vol. 16, no. 4, pp. 561–570, 2009
2009
-
[12]
A review of approaches to identifying patient phenotype cohorts using electronic health records,
C. Shivade, P. Raghavan, E. Fosler-Lussier, P. J. Embi, N. Elhadad, S. B. Johnson, and A. M. Lai, “A review of approaches to identifying patient phenotype cohorts using electronic health records,”Journal of the American Medical Informatics Association, vol. 21, no. 2, pp. 221–230, 2014
2014
-
[13]
Enhancing noise robustness of retrieval-augmented language models with adaptive adversarial training,
F. Fang, Y . Bai, S. Ni, M. Yang, X. Chen, and R. Xu, “Enhancing noise robustness of retrieval-augmented language models with adaptive adversarial training,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 10 028–10 039
2024
-
[14]
Making retrieval- augmented language models robust to irrelevant context,
O. Yoran, T. Wolfson, O. Ram, and J. Berant, “Making retrieval- augmented language models robust to irrelevant context,” inThe Twelfth International Conference on Learning Representations, 2024
2024
-
[15]
Chain-of-note: Enhancing robustness in retrieval-augmented language models,
W. Yu, H. Zhang, X. Pan, P. Cao, K. Ma, J. Li, H. Wang, and D. Yu, “Chain-of-note: Enhancing robustness in retrieval-augmented language models,” inProceedings of the 2024 conference on empirical methods in natural language processing, 2024, pp. 14 672–14 685
2024
-
[16]
Self-RAG: Learning to retrieve, generate, and critique through self-reflection,
A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi, “Self-RAG: Learning to retrieve, generate, and critique through self-reflection,” inThe Twelfth International Conference on Learning Representations, 2024
2024
-
[17]
Ragtruth: A hallucination corpus for developing trustworthy retrieval- augmented language models,
C. Niu, Y . Wu, J. Zhu, S. Xu, K. Shum, R. Zhong, J. Song, and T. Zhang, “Ragtruth: A hallucination corpus for developing trustworthy retrieval- augmented language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 10 862–10 878
2024
-
[18]
Retrieval- augmented generation with estimation of source reliability,
J. Hwang, J. Park, H. Park, D. Kim, S. Park, and J. Ok, “Retrieval- augmented generation with estimation of source reliability,” inProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 34 267–34 291
2025
-
[19]
Calibrate before use: Improving few-shot performance of language models,
Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh, “Calibrate before use: Improving few-shot performance of language models,” in International conference on machine learning. PMLR, 2021, pp. 12 697–12 706
2021
-
[20]
Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation,
D. Ru, L. Qiu, X. Hu, T. Zhang, P. Shi, S. Chang, C. Jiayang, C. Wang, S. Sun, H. Liet al., “Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 21 999–22 027
2024
-
[21]
On scalable oversight with weak llms judging strong llms,
Z. Kenton, N. Siegel, J. Kram ´ar, J. Brown-Cohen, S. Albanie, J. Bulian, R. Agarwal, D. Lindner, Y . Tang, N. Goodmanet al., “On scalable oversight with weak llms judging strong llms,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 75 229–75 276
2024
-
[22]
Reciprocal rank fusion outperforms condorcet and individual rank learning methods,
G. V . Cormack, C. L. Clarke, and S. Buettcher, “Reciprocal rank fusion outperforms condorcet and individual rank learning methods,” inProceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, 2009, pp. 758–759
2009
-
[23]
MEDEC: A benchmark for medical error detection and correction in clinical notes,
A. Ben Abacha, W.-w. Yim, Y . Fu, Z. Sun, M. Yetisgen, F. Xia, and T. Lin, “MEDEC: A benchmark for medical error detection and correction in clinical notes,” inFindings of the Association for Compu- tational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025...
2025
-
[24]
Evaluation of a natural language processing approach to identify diagnostic errors and analysis of safety learning system case review data: retrospective cohort study,
A. Tabaie, A. Tran, T. Calabria, S. S. Bennett, A. Milicia, W. Weintraub, W. J. Gallagher, J. Yosaitis, L. C. Schubel, M. A. Hillet al., “Evaluation of a natural language processing approach to identify diagnostic errors and analysis of safety learning system case review data: retrospective cohort study,”Journal of Medical Internet Research, vol. 26, p. e...
2024
-
[25]
Extracting adverse drug events from clinical notes: A systematic review of ap- proaches used,
S. Modi, K. A. Kasmiran, N. M. Sharef, and M. Y . Sharum, “Extracting adverse drug events from clinical notes: A systematic review of ap- proaches used,”Journal of Biomedical Informatics, vol. 151, p. 104603, 2024
2024
-
[26]
Bm25 query augmentation learned end-to- end,
X. Chen and S. Wiseman, “Bm25 query augmentation learned end-to- end,” 2023
2023
-
[27]
Rag-fusion: a new take on retrieval-augmented gener- ation,
Z. Rackauckas, “Rag-fusion: a new take on retrieval-augmented gener- ation,” 2024
2024
-
[28]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “GPT-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
GPT-5.2 model,
OpenAI, “GPT-5.2 model,” [Online]. Available: https://developers. openai.com/api/docs/models/gpt-5.2, 2025, openAI API model docu- mentation. Accessed: Feb. 7, 2026
2025
-
[30]
Gemini 2.0 Flash model card,
Google DeepMind, “Gemini 2.0 Flash model card,” [Online]. Available: https://modelcards.withgoogle.com/assets/documents/gemini-2-flash. pdf, Apr. 2025, accessed: Feb. 7, 2026
2025
-
[31]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
The llama 3 herd of models,
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,” pp. arXiv–2407, 2024
2024
-
[33]
Gemini Embedding: Generalizable Embeddings from Gemini
J. Lee, F. Chen, S. Dua, D. Cer, M. Shanbhogue, I. Naim, G. H. ´Abrego, Z. Li, K. Chen, H. S. Veraet al., “Gemini embedding: Generalizable embeddings from gemini,”arXiv preprint arXiv:2503.07891, 2025
work page internal anchor Pith review arXiv 2025
-
[34]
Overview of the mediqa-corr 2024 shared task on medical error detection and correction,
A. B. Abacha, W.-w. Yim, Y . Fu, Z. Sun, F. Xia, and M. Yetisgen- Yildiz, “Overview of the mediqa-corr 2024 shared task on medical error detection and correction,” inProceedings of the 6th Clinical Natural Language Processing Workshop, 2024, pp. 596–603
2024
-
[35]
MedHallu: A comprehensive benchmark for detecting medical hal- lucinations in large language models,
S. Pandit, J. Xu, J. Hong, Z. Wang, T. Chen, K. Xu, and Y . Ding, “MedHallu: A comprehensive benchmark for detecting medical hal- lucinations in large language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for ...
2025
-
[36]
Comparing two model designs for clinical note generation; is an LLM a useful evaluator of consistency?
N. Brake and T. Schaaf, “Comparing two model designs for clinical note generation; is an LLM a useful evaluator of consistency?” inFindings of the Association for Computational Linguistics: NAACL 2024, K. Duh, H. Gomez, and S. Bethard, Eds. Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024, pp. 352–363
2024
-
[37]
From feedback to checklists: Grounded evaluation of ai-generated clinical notes,
K. Zhou, J. M. Giorgi, P. Mani, P. Xu, D. Liang, and C. Tan, “From feedback to checklists: Grounded evaluation of ai-generated clinical notes,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2025, pp. 1485–1499
2025
-
[38]
SYNFAC-EDIT: Synthetic imitation edit feedback for factual alignment in clinical summarization,
P. Mishra, Z. Yao, P. Vashisht, F. Ouyang, B. Wang, V . D. Mody, and H. Yu, “SYNFAC-EDIT: Synthetic imitation edit feedback for factual alignment in clinical summarization,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 20 061–20 083
2024
-
[39]
Tracsum: A new benchmark for aspect-based summarization with sentence-level traceability in medical domain,
B. Chu, M. Li, S. Frihat, C. Gu, G. Lodde, E. Livingstone, and N. Fuhr, “Tracsum: A new benchmark for aspect-based summarization with sentence-level traceability in medical domain,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 844–864
2025
-
[40]
Benchmarking retrieval- augmented generation for medicine,
G. Xiong, Q. Jin, Z. Lu, and A. Zhang, “Benchmarking retrieval- augmented generation for medicine,” inFindings of the Association for Computational Linguistics ACL 2024, 2024, pp. 6233–6251
2024
-
[41]
Cluster-based partial dense retrieval fused with sparse text retrieval,
Y . Yang, P. Carlson, S. He, Y . Qiao, and T. Yang, “Cluster-based partial dense retrieval fused with sparse text retrieval,” inProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp. 2327–2331
2024
-
[42]
Poqd: Performance- oriented query decomposer for multi-vector retrieval,
Y . Liu, J. Li, Y . Wu, and Z. Chen, “Poqd: Performance- oriented query decomposer for multi-vector retrieval,”arXiv preprint arXiv:2505.19189, 2025
-
[43]
Multi-llm debate: Framework, principals, and interventions,
A. Estornell and Y . Liu, “Multi-llm debate: Framework, principals, and interventions,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 28 938–28 964
2024
-
[44]
Hallucination detection in structured query generation via llm self-debating,
M. Li, J. Chen, M. Xu, and X. Wang, “Hallucination detection in structured query generation via llm self-debating,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025, pp. 16 102–16 113
2025
-
[45]
Harnessing large language models as post-hoc correctors,
Z. Zhong, K. Zhou, and D. Mottin, “Harnessing large language models as post-hoc correctors,”arXiv preprint arXiv:2402.13414, 2024
-
[46]
A. Smit, P. Duckworth, N. Grinsztajn, T. D. Barrett, and A. Pretorius, “Should we be going mad? a look at multi-agent debate strategies for llms,”arXiv preprint arXiv:2311.17371, 2023
-
[47]
Can llms judge debates? evaluating non-linear reasoning via argumentation theory se- mantics,
R. Sanayei, S. Vesic, E. Blanco, and M. Surdeanu, “Can llms judge debates? evaluating non-linear reasoning via argumentation theory se- mantics,”arXiv preprint arXiv:2509.15739, 2025
-
[48]
Multi-agent debate for LLM judges with adaptive stability detection,
T. Hu, Z. Tan, S. Wang, H. Qu, and T. Chen, “Multi-agent debate for LLM judges with adaptive stability detection,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id=Vusd1Hw2D9
2025
-
[49]
Faithful, unfaithful or ambiguous? multi-agent debate with initial stance for summary evaluation,
M. Koupaee, J. W. Vincent, S. Mansour, I. Shalyminov, H. He, H. Song, R. Shu, J. He, Y . Nian, A. W.-m. Wong, K. J. Han, and H. Su, “Faithful, unfaithful or ambiguous? multi-agent debate with initial stance for summary evaluation,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics:...
2025
-
[50]
R 2-Guard: Robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning,
M. Kang and B. Li, “R 2-Guard: Robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning,” inThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[51]
C-safegen: Certified safe LLM generation with claim-based streaming guardrails,
M. Kang, Z. Chen, and B. Li, “C-safegen: Certified safe LLM generation with claim-based streaming guardrails,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[52]
Slot: Structuring the output of large language models,
D. Y .-B. Wang, Z. Shen, S. S. Mishra, Z. Xu, Y . Teng, and H. Ding, “Slot: Structuring the output of large language models,”arXiv preprint arXiv:2505.04016, 2025
-
[53]
Disambiguation of acronyms in clinical narratives with large language models,
A. Kugic, S. Schulz, and M. Kreuzthaler, “Disambiguation of acronyms in clinical narratives with large language models,”Journal of the American Medical Informatics Association, vol. 31, no. 9, pp. 2040– 2046, 2024
2040
-
[54]
Selective generation for con- trollable language models,
M. Lee, K. Kim, T. Kim, and S. Park, “Selective generation for con- trollable language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 50 494–50 527, 2024
2024
-
[55]
Cultivating multidisciplinary research and education on gpu infrastructure for mid-south institutions at the university of memphis: Practice and challenge,
M. Sharif, G. Han, W. Liu, and X. Huang, “Cultivating multidisciplinary research and education on gpu infrastructure for mid-south institutions at the university of memphis: Practice and challenge,” 2025. APPENDIX SYSTEMPROMPTS This appendix documents the system prompts used by BLUEmed for the two domain experts and the adjudicator judge. To avoid redunda...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.