arxiv: 2604.23356 · v1 · submitted 2026-04-25 · 💻 cs.CL · cs.HC

Recognition: unknown

VeriLLMed: Interactive Visual Debugging of Medical Large Language Models with Knowledge Graphs

Haipeng Zeng, Huamin Qu, Rui Sheng, Xingyi Mao, Yanna Lin, Yurui Xiang, Yushi Sun, Yuyang Wu, Zelin Zang, Zixin Chen

Pith reviewed 2026-05-08 08:00 UTC · model grok-4.3

classification 💻 cs.CL cs.HC

keywords medical large language modelsknowledge graphsvisual analyticsdiagnostic reasoningerror detectiondebugging toolsbiomedical AIreasoning paths

0 comments

The pith

VeriLLMed converts medical LLM outputs into reasoning paths and compares them against biomedical knowledge graph references to expose three classes of diagnostic errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical LLMs can produce diagnostic outputs that look plausible but contain clinically unreliable reasoning steps, yet developers struggle to inspect them at scale without domain expertise. The paper presents VeriLLMed as a visual system that turns model responses into structured paths, builds matching reference paths from external knowledge graphs, and flags mismatches. It isolates three recurring error types: incorrect relations between concepts, wrong branching in the reasoning tree, and entirely omitted necessary steps. Case studies and expert reviews indicate that this approach surfaces implausible chains and points to concrete model fixes. A sympathetic reader would care because reliable medical AI requires catching such failures before deployment affects patient decisions.

Core claim

VeriLLMed transforms model outputs into comparable reasoning paths, constructs knowledge graph-grounded reference paths, and identifies three recurring classes of diagnosis errors: relation errors, branch errors, and missing errors. Case studies and expert evaluation demonstrate that VeriLLMed helps developers identify clinically implausible reasoning and generate actionable insights that can inform the improvement of medical LLMs.

What carries the argument

The VeriLLMed visual analytics system that converts LLM diagnostic outputs into reasoning paths and aligns them with biomedical knowledge graph reference paths to classify errors.

If this is right

Developers gain a way to move from isolated manual reviews to pattern-based inspection across many cases.
Three specific error categories become visible targets for model retraining or prompt engineering.
Insights from the system can directly guide iterative improvements to medical LLMs before they reach clinical use.
Prioritization of which failures to fix first becomes possible when recurring patterns are quantified.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same path-comparison approach could be adapted to debug LLMs in other high-stakes fields such as legal reasoning or financial forecasting.
Keeping the reference knowledge graphs current would require ongoing maintenance pipelines as medical knowledge evolves.
The three error classes might serve as labels for new training objectives that penalize relation, branch, and missing mistakes explicitly.

Load-bearing premise

The selected biomedical knowledge graphs supply accurate, complete, and current reference paths that correctly represent proper diagnostic reasoning in the tested cases.

What would settle it

An independent panel of clinicians reviewing the same model outputs could find that the errors flagged by VeriLLMed do not match actual clinical implausibility or that the reference paths themselves contain inaccuracies.

Figures

Figures reproduced from arXiv: 2604.23356 by Haipeng Zeng, Huamin Qu, Rui Sheng, Xingyi Mao, Yanna Lin, Yurui Xiang, Yushi Sun, Yuyang Wu, Zelin Zang, Zixin Chen.

**Figure 1.** Figure 1: The user interface of VeriLLMed. (A) Dataset Overview summarizes the distribution of reasoning errors and overall model accuracy. (B) Projection View localizes erroneous concept regions in the Bio-KG embedding space. (C) Path View compares aggregated local reasoning structures around selected entities. (D) Error View summarizes recurrent erroneous reasoning and their corresponding reference reasoning. (E) … view at source ↗

**Figure 2.** Figure 2: The pipeline of data processing and component extraction. Given a medical QA case, the diagnosis model first predicts the answer together view at source ↗

**Figure 3.** Figure 3: (A) The glyph design of the Path View, based on the Bio-KG embedding space. Node positions indicate their connections in the KG. (B) We also considered a matrix-based alternative, where each cell represents a transition from a row entity to a column entity and contains a similar glyph summarizing path participation and error composition. structures around selected entities and identify which nodes warrant … view at source ↗

**Figure 4.** Figure 4: (A) Our design uses a comparative Sankey layout to summarize view at source ↗

read the original abstract

Large language models (LLMs) show promise in medical diagnosis, but real-world deployment remains challenging due to high-stakes clinical decisions and imperfect reasoning reliability. As a result, careful inspection of model behavior is essential for assessing whether diagnostic reasoning is reliable and clinically grounded. However, debugging medical LLMs remains difficult. First, developers often lack sufficient medical domain expertise to interpret model errors in clinically meaningful terms. Second, models can fail across a large and diverse set of instances involving different input types, tasks, and reasoning steps, making it challenging for developers to prioritize which errors deserve focused inspection. Third, developers struggle to identify recurring error patterns across cases, as existing debugging practices are largely instance-centric and rely on manual inspection of isolated failures. To address these challenges, we present VeriLLMed, a visual analytics system that integrates external biomedical knowledge to audit and debug medical LLM diagnostic reasoning. VeriLLMed transforms model outputs into comparable reasoning paths, constructs knowledge graph-grounded reference paths, and identifies three recurring classes of diagnosis errors: relation errors, branch errors, and missing errors. Case studies and expert evaluation demonstrate that VeriLLMed helps developers identify clinically implausible reasoning and generate actionable insights that can inform the improvement of medical LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VeriLLMed builds a visual system that compares medical LLM outputs to biomedical KG paths and sorts deviations into relation, branch, and missing errors, but the supporting evidence stays qualitative and the KG reliability is untested.

read the letter

The paper presents VeriLLMed as an interactive visual analytics tool that turns LLM diagnostic outputs into reasoning paths, aligns them against reference paths from biomedical knowledge graphs, and labels mismatches as one of three error classes. This directly targets the practical headaches of debugging medical LLMs: non-experts can't easily interpret failures, single-case reviews don't scale, and patterns stay hidden without structure. The interface and taxonomy give developers a way to scan many instances at once and spot recurring issues without needing full medical training themselves. That combination of KG grounding plus visual comparison is the concrete new element here, and the case studies illustrate how it might surface clinically relevant problems in practice. The design choices around path comparison and error categorization look reasonable for diagnostic tasks and address the three challenges laid out in the introduction. The expert evaluation is presented as confirmation that the tool yields actionable insights. The main limitation is the evidence base. Everything rests on case studies and expert feedback without reported metrics, participant counts, controls, or inter-rater details. More critically, the paper does not show that the chosen KG paths match independent clinician reasoning on the same cases or test how incomplete or outdated graph coverage would change the detected error classes. If those reference paths are not reliable proxies, the error labels and the claim of identifying implausible reasoning lose their grounding. This work is aimed at teams developing or auditing medical LLMs who already have access to domain knowledge graphs and want something beyond manual inspection. Readers working on visual tools for AI safety or LLM evaluation in high-stakes domains could extract useful interface ideas or the error taxonomy. It deserves peer review because the system is implemented and demonstrated, the problem is real, and referees can require the missing quantitative checks and validation steps without starting from zero.

Referee Report

2 major / 2 minor

Summary. The paper presents VeriLLMed, a visual analytics system that integrates biomedical knowledge graphs to audit medical LLM diagnostic reasoning. It transforms model outputs into comparable reasoning paths, constructs KG-grounded reference paths, classifies errors into relation errors, branch errors, and missing errors, and uses interactive visualizations to help developers identify recurring patterns. Case studies and expert evaluation are offered as evidence that the system enables identification of clinically implausible reasoning and generation of actionable insights for LLM improvement.

Significance. If the KG reference paths reliably proxy correct clinical reasoning, VeriLLMed could meaningfully address the three stated challenges (domain expertise gaps, error prioritization across instances, and pattern detection) by providing a structured, visual debugging workflow for medical LLMs. The error taxonomy and visual interface represent a practical contribution to LLM auditing tools, though the absence of quantitative validation metrics limits the strength of the supporting evidence.

major comments (2)

[System overview and error classification (sections describing reference path construction and error taxonomy)] The central claim that VeriLLMed identifies 'clinically implausible reasoning' rests on the assumption that the chosen biomedical KGs yield reference paths that accurately and completely encode correct diagnostic reasoning. No quantitative comparison of KG-derived paths against independent clinician-annotated paths, no coverage statistics for the evaluated cases, and no sensitivity analysis for KG incompleteness or outdated relations are provided; this makes the error classifications (relation/branch/missing) difficult to interpret as clinically grounded rather than KG-specific artifacts.
[Case studies and expert evaluation] The evaluation relies on case studies and expert evaluation to support claims of actionable insights and identification of implausible reasoning, yet reports no quantitative metrics (e.g., inter-rater agreement, task completion times, error detection rates), participant numbers, study design details, or controls. This leaves the strength of evidence for the central usability and insight-generation claims unclear.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief statement of the number of cases, experts, and KGs used in the evaluation to allow readers to gauge scope immediately.
[Figures and visualizations] Ensure all figures include clear legends, axis labels, and direct references in the text; some visual elements may require additional annotation to distinguish model paths from reference paths at a glance.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of grounding our error classifications and strengthening the evaluation evidence. We address each major comment point-by-point below, indicating planned revisions where they align with the manuscript's scope. We believe these changes will clarify the claims without overstating the current evidence.

read point-by-point responses

Referee: [System overview and error classification (sections describing reference path construction and error taxonomy)] The central claim that VeriLLMed identifies 'clinically implausible reasoning' rests on the assumption that the chosen biomedical KGs yield reference paths that accurately and completely encode correct diagnostic reasoning. No quantitative comparison of KG-derived paths against independent clinician-annotated paths, no coverage statistics for the evaluated cases, and no sensitivity analysis for KG incompleteness or outdated relations are provided; this makes the error classifications (relation/branch/missing) difficult to interpret as clinically grounded rather than KG-specific artifacts.

Authors: We agree that the fidelity of KG-derived reference paths as proxies for correct clinical reasoning is a foundational assumption, and that the absence of quantitative validation metrics makes it harder to interpret the error classifications as fully clinically grounded. The manuscript (Section 3.2) constructs reference paths from established biomedical KGs and defines the three error classes explicitly as deviations from these paths; the case studies then illustrate how these deviations surface implausible steps. The expert evaluation provides qualitative confirmation that experts viewed the flagged errors as clinically relevant. However, we did not include coverage statistics, sensitivity analysis, or direct comparisons to independent clinician annotations in the original submission. In the revised version, we will add a new limitations subsection in the Discussion that reports coverage statistics for the evaluated cases (derived from the KG) and discusses potential impacts of KG incompleteness. We will also explicitly scope the claims to 'KG-grounded' rather than claiming absolute clinical correctness. A full quantitative clinician-annotation study lies beyond the current work's resources and is noted as future work. revision: partial
Referee: [Case studies and expert evaluation] The evaluation relies on case studies and expert evaluation to support claims of actionable insights and identification of implausible reasoning, yet reports no quantitative metrics (e.g., inter-rater agreement, task completion times, error detection rates), participant numbers, study design details, or controls. This leaves the strength of evidence for the central usability and insight-generation claims unclear.

Authors: We acknowledge that the Evaluation section would benefit from greater transparency on study details to better support the claims of actionable insights. The manuscript presents two detailed case studies showing how the visual interface reveals recurring error patterns, followed by expert evaluation in which domain experts interacted with the system to audit LLM outputs and generate improvement suggestions. To strengthen this, we will revise the section to specify the number of expert participants, the semi-structured interview protocol used, and any observed consistency in the insights generated. While the original study did not collect quantitative metrics such as inter-rater agreement or task completion times (focusing instead on qualitative demonstration of the workflow), we will add an explicit statement of this scope and note that controlled quantitative usability testing remains valuable future work. These additions will make the evidence base clearer without changing the qualitative nature of the reported findings. revision: partial

standing simulated objections not resolved

A full quantitative validation comparing KG-derived paths against independent clinician-annotated reference paths would require new data collection and annotation effort that is not feasible within the timeline or scope of this revision.

Circularity Check

0 steps flagged

No significant circularity; system design validated externally

full rationale

The paper describes a visual analytics system (VeriLLMed) that transforms LLM outputs into reasoning paths, builds reference paths from external biomedical KGs, and classifies errors into relation/branch/missing types. Central claims rest on case studies and expert evaluation demonstrating utility for identifying implausible reasoning. No mathematical derivations, equations, fitted parameters, or predictions appear in the text. No self-citations serve as load-bearing uniqueness theorems, and no ansatzes or renamings reduce the result to its inputs by construction. The KG reference paths and expert judgments function as external benchmarks, keeping the evaluation self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work is a system-building paper rather than a derivation; it relies on the domain assumption that biomedical knowledge graphs can serve as ground truth and introduces the three error classes as a new categorization scheme.

axioms (1)

domain assumption Biomedical knowledge graphs provide reliable reference paths for correct diagnostic reasoning.
Invoked when constructing reference paths to compare against model outputs.

invented entities (1)

Relation errors, branch errors, and missing errors no independent evidence
purpose: To classify recurring patterns of flawed diagnostic reasoning in medical LLMs.
These three categories are defined by the paper based on the system's analysis of model outputs.

pith-pipeline@v0.9.0 · 5547 in / 1274 out tokens · 55996 ms · 2026-05-08T08:00:09.643050+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 10 canonical work pages · 2 internal anchors

[1]

Abu-Salih, M

B. Abu-Salih, M. Al-Qurishi, M. Alweshah, M. Al-Smadi, R. Alfayez, and H. Saadeh. Healthcare knowledge graph construction: A systematic review of the state-of-the-art, open issues, and opportunities.Journal of Big Data, 10(1):81, 2023. 9

2023
[2]

Agrawal, S

M. Agrawal, S. Hegselmann, H. Lang, Y . Kim, and D. Sontag. Large language models are few-shot clinical information extractors. InProceed- ings of the 2022 conference on empirical methods in natural language processing, pp. 1998–2022, 2022. 4

2022
[3]

H. S. Al Khatib, S. Neupane, H. Kumar Manchukonda, N. A. Golilarz, S. Mittal, A. Amirlatifi, and S. Rahimi. Patient-centric knowledge graphs: a survey of current methods, challenges, and applications.Frontiers in Artificial Intelligence, 7:1388479, 2024. 3

2024
[4]

Balachandran

A. Balachandran. Fine-Tuned Embedding Models for Medi- cal / Clinical IR. https://huggingface.co/blog/abhinand/ medembed-finetuned-embedding-models-for-medical-ir , 2024. Hugging Face blog, accessed March 14, 2026. 4

2024
[5]

D. Bang, S. Lim, S. Lee, and S. Kim. Biomedical knowledge graph learning for drug repurposing by extending guilt-by-association to multiple layers.Nature Communications, 14(1):3570, 2023. 3

2023
[6]

Battogtokh, C

M. Battogtokh, C. Davidescu, M. Luck, and R. Borgo. Semla: A visual analysis system for fine-grained text classification. InProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 23772–23774,
[7]

S. Bedi, Y . Jiang, P. Chung, S. Koyejo, and N. Shah. Fidelity of medical reasoning in large language models.JAMA Network Open, 8(8):e2526021,
[8]

A. M. Brasoveanu, A. Scharl, L. J. Nixon, and R. Andonie. Visualiz- ing large language models: A brief survey. In2024 28th International Conference Information Visualisation (IV), pp. 236–245. IEEE, 2024. 2

2024
[9]

X. Chen, J. Xiang, S. Lu, Y . Liu, M. He, and D. Shi. Evaluating large language models and agents in healthcare: key challenges in clinical applications.Intelligent Medicine, 5(02):151–163, 2025. 2

2025
[10]

H. Cui, J. Lu, S. Wang, R. Xu, W. Ma, S. Yu, Y . Yu, X. Kan, C. Ling, J. Ho, et al. A survey on knowledge graphs for healthcare: Resources, applications, and promises.arXiv preprint arXiv:2306.04802, 2023. 3

work page arXiv 2023
[11]

H. Cui, J. Lu, R. Xu, S. Wang, W. Ma, Y . Yu, S. Yu, X. Kan, C. Ling, L. Zhao, et al. A review on knowledge graphs for healthcare: Resources, applications, and promises.Journal of biomedical informatics, p. 104861,
[12]

Y . Gao, R. Li, E. Croxford, J. Caskey, B. W. Patterson, M. Churpek, T. Miller, D. Dligach, and M. Afshar. Leveraging medical knowledge graphs into large language models for diagnosis prediction: Design and application study.Jmir Ai, 4:e58670, 2025. 3

2025
[13]

Grover and J

A. Grover and J. Leskovec. node2vec: Scalable feature learning for net- works. InProceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864, 2016. 6

2016
[14]

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger. On calibration of modern neural networks. InInternational conference on machine learning, pp. 1321–1330. PMLR, 2017. 2

2017
[15]

C. T. Hoyt, D. Domingo-Fernández, R. Aldisi, L. Xu, K. Kolpeja, S. Spalek, E. Wollert, J. Bachman, B. M. Gyori, P. Greene, et al. Re- curation and rational enrichment of knowledge graphs in biological ex- pression language.Database, 2019:baz068, 2019. 9

2019
[16]

Y . Hu, Q. Chen, J. Du, X. Peng, V . K. Keloth, X. Zuo, Y . Zhou, Z. Li, X. Jiang, Z. Lu, et al. Improving large language models for clinical named entity recognition via prompt engineering.Journal of the American Medical Informatics Association, 31(9):1812–1820, 2024. 4

2024
[17]

Huang, J

X. Huang, J. Zhang, Z. Xu, L. Ou, and J. Tong. A knowledge graph based question answering method for medical domain.PeerJ Computer Science, 7:e667, 2021. 3

2021
[18]

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023. 2

2023
[19]

M. Jia, J. Duan, Y . Song, and J. Wang. medikal: Integrating knowl- edge graphs as assistants of llms for enhanced clinical diagnosis on emrs. InProceedings of the 31st International Conference on Computational Linguistics, pp. 9278–9298, 2025. 3

2025
[20]

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421,
[21]

Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 2567–2577, 2019. 1, 2

2019
[22]

Language Models (Mostly) Know What They Know

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022. 2

work page internal anchor Pith review arXiv 2022
[23]

Khodadad, A

M. Khodadad, A. S. Kasmaee, M. Astaraki, and H. Mahyar. Towards domain specification of embedding models in medicine.arXiv preprint arXiv:2507.19407, 2025. 4

work page arXiv 2025
[24]

Y . Kim, J. Wu, Y . Abdulle, and H. Wu. Medexqa: Medical question answering benchmark with multiple explanations. InProceedings of the 23rd Workshop on biomedical natural language processing, pp. 167–181,
[25]

Lacerda, G

A. Lacerda, G. Pappa, A. C. M. Pereira, W. Meira Jr, and A. G. de Almeida Barros. Evaluation of medical large language models: taxon- omy, review, and directions. InProceedings of the Thirty-Fourth Interna- tional Joint Conference on Artificial Intelligence, pp. 10528–10536, 2025. 2

2025
[26]

Measuring Faithfulness in Chain-of-Thought Reasoning

T. Lanham, A. Chen, A. Radhakrishnan, B. Steiner, C. Denison, D. Her- nandez, D. Li, E. Durmus, E. Hubinger, J. Kernion, et al. Measuring faith- fulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702,

work page internal anchor Pith review arXiv
[27]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023. 2, 4

2023
[28]

M. Liu, W. Hu, J. Ding, J. Xu, X. Li, L. Zhu, Z. Bai, X. Shi, B. Wang, H. Song, et al. Medbench: A comprehensive, standardized, and reli- able benchmarking system for evaluating chinese medical large language models.Big Data Mining and Analytics, 7(4):1116–1128, 2024. 2

2024
[29]

Y . Lu, S. Y . Goi, X. Zhao, and J. Wang. Biomedical knowledge graph: A survey of domains, tasks, and real-world applications.arXiv preprint arXiv:2501.11632, 2025. 3

work page arXiv 2025
[30]

K. Lyu, Y . Tian, Y . Shang, T. Zhou, Z. Yang, Q. Liu, X. Yao, P. Zhang, J. Chen, and J. Li. Causal knowledge graph construction and evaluation for clinical decision support of diabetic nephropathy.Journal of Biomedical Informatics, 139:104298, 2023. 3

2023
[31]

L. G. McCoy, R. Swamy, N. Sagar, M. Wang, S. Bacchi, J. M. N. Fong, N. C. Tan, K. Tan, T. A. Buckley, P. Brodeur, et al. Assessment of large language models in clinical reasoning: a novel benchmarking study.NEJM AI, 2(10):AIdbp2500120, 2025. 2

2025
[32]

Mohseni, N

S. Mohseni, N. Zarei, and E. D. Ragan. A multidisciplinary survey and framework for design and evaluation of explainable ai systems.ACM Transactions on Interactive Intelligent Systems (TiiS), 11(3-4):1–45, 2021. 3

2021
[33]

H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz. Capabilities of gpt-4 on medical challenge problems.arXiv preprint arXiv:2303.13375, 2023. 1, 2, 4

work page arXiv 2023
[34]

GPT-5 mini Model

OpenAI. GPT-5 mini Model. https://developers.openai.com/ api/docs/models/gpt-5-mini , 2025. OpenAI API documentation, accessed March 14, 2026. 4

2025
[35]

Structured model outputs

OpenAI. Structured model outputs. https://developers.openai. com/api/docs/guides/structured-outputs/, 2025. OpenAI API documentation, accessed March 14, 2026. 4

2025
[36]

Introducing OpenAI for Healthcare

OpenAI. Introducing OpenAI for Healthcare. https://openai.com/ index/openai-for-healthcare/ , 2026. OpenAI webpage, January 8, 2026, accessed March 14, 2026. 4

2026
[37]

OpenAI. Models. https://developers.openai.com/api/docs/ models, 2026. OpenAI API documentation, accessed March 14, 2026. 4

2026
[38]

A. Pal, L. K. Umapathi, and M. Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on health, inference, and learning, pp. 248–260. PMLR,
[39]

J. Z. Pan, S. Razniewski, J.-C. Kalo, S. Singhania, J. Chen, S. Dietze, H. Jabeen, J. Omeliyanenko, W. Zhang, M. Lissandrini, et al. Large language models and knowledge graphs: Opportunities and challenges. arXiv preprint arXiv:2308.06374, 2023. 3

work page arXiv 2023
[40]

P. Qiu, C. Wu, S. Liu, Y . Fan, W. Zhao, Z. Chen, H. Gu, C. Peng, Y . Zhang, Y . Wang, et al. Quantifying the reasoning abilities of llms on clinical cases. Nature Communications, 16(1):9799, 2025. 2

2025
[41]

Rajani, W

N. Rajani, W. Liang, L. Chen, M. Mitchell, and J. Zou. Seal: Interactive 10 © 2026 IEEE. This is the author’s version of the article that has been published in IEEE Transactions on Visualization and Computer Graphics. The final version of this record is available at: xx.xxxx/TVCG.201x.xxxxxxx/ tool for systematic error analysis and labeling. InProceedings...

2026
[42]

M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. Beyond accuracy: Behavioral testing of nlp models with checklist. InProceedings of the 58th annual meeting of the association for computational linguistics, pp. 4902–4912, 2020. 2

2020
[43]

Sarti, N

G. Sarti, N. Feldhus, L. Sickert, and O. Van Der Wal. Inseq: An inter- pretability toolkit for sequence generation models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pp. 421–435, 2023. 2

2023
[44]

R. SHENG. A survey of llm-based multi-agent systems in medicine. Authorea Preprints, 2025. 1

2025
[45]

Sheng, Y

R. Sheng, Y . Yang, C. Shi, Y . Lin, Z. Chen, H. Qu, and F. Cheng. Dills: In- teractive diagnosis of llm-based multi-agent systems via layered summary of agent behaviors.arXiv preprint arXiv:2602.05446, 2026. 2

work page arXiv 2026
[46]

Singhal, S

K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023. 1, 2, 4

2023
[47]

Singhal, T

K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewis, et al. Toward expert-level med- ical question answering with large language models.Nature medicine, 31(3):943–950, 2025. 1, 2, 4

2025
[48]

Sivarajkumar, M

S. Sivarajkumar, M. Kelley, A. Samolyk-Mazzanti, S. Visweswaran, and Y . Wang. An empirical evaluation of prompting strategies for large lan- guage models in zero-shot clinical natural language processing: algorithm development and validation study.JMIR Medical Informatics, 12:e55318,
[49]

Soman, P

K. Soman, P. W. Rose, J. H. Morris, R. E. Akbas, B. Smith, B. Peetoom, C. Villouta-Reyes, G. Cerono, Y . Shi, A. Rizk-Jackson, et al. Biomedical knowledge graph-optimized prompt generation for large language models. Bioinformatics, 40(9):btae560, 2024. 3

2024
[50]

Spinner, U

T. Spinner, U. Schlegel, H. Schäfer, and M. El-Assady. explainer: A visual analytics framework for interactive and explainable machine learning. IEEE transactions on visualization and computer graphics, 26(1):1064– 1074, 2019. 3, 4

2019
[51]

Strobelt, S

H. Strobelt, S. Gehrmann, M. Behrisch, A. Perer, H. Pfister, and A. M. Rush. S eq 2s eq-v is: A visual debugging tool for sequence-to-sequence models.IEEE transactions on visualization and computer graphics, 25(1):353–363, 2018. 2, 3, 4

2018
[52]

Strobelt, S

H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush. Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE transactions on visualization and computer graphics, 24(1):667–676,
[53]

C. Su, Y . Hou, M. Zhou, S. Rajendran, J. R. Maasch, Z. Abedi, H. Zhang, Z. Bai, A. Cuturrufo, W. Guo, et al. Biomedical discovery through the integrative biomedical knowledge hub (ibkh).Iscience, 26(4), 2023. 2, 3, 4, 9

2023
[54]

T. Y . C. Tam, S. Sivarajkumar, S. Kapoor, A. V . Stolyar, K. Polanska, K. R. McCarthy, H. Osterhoudt, X. Wu, S. Visweswaran, S. Fu, et al. A framework for human evaluation of large language models in healthcare derived from literature review.NPJ digital medicine, 7(1):258, 2024. 2

2024
[55]

Tenney, J

I. Tenney, J. Wexler, J. Bastings, T. Bolukbasi, A. Coenen, S. Gehrmann, E. Jiang, M. Pushkarna, C. Radebaugh, E. Reif, et al. The language interpretability tool: Extensible, interactive visualizations and analysis for nlp models. InProceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 107–118,

2020
[56]

Tufanov, K

I. Tufanov, K. Hambardzumyan, J. Ferrando, and E. V oita. Lm trans- parency tool: Interactive tool for analyzing transformer language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 3: System Demonstrations), pp. 51–60, 2024. 2

2024
[57]

Turpin, J

M. Turpin, J. Michael, E. Perez, and S. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of- thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023. 1, 2

2023
[58]

Van der Maaten and G

L. Van der Maaten and G. Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008. 6

2008
[59]

J. Vig. A multiscale visualization of attention in the transformer model. InProceedings of the 57th annual meeting of the association for computa- tional linguistics: system demonstrations, pp. 37–42, 2019. 2

2019
[60]

J. Wang, S. Liu, and W. Zhang. Visual analytics for machine learning: A data perspective survey.IEEE transactions on visualization and computer graphics, 30(12):7637–7656, 2024. 2

2024
[61]

Q. Wang, R. Sheng, Y . Li, H. Qu, Y . Sun, and M. Zhu. Medkgi: Iterative differential diagnosis with medical knowledge graphs and information- guided inquiring.arXiv preprint arXiv:2512.24181, 2025. 1

work page arXiv 2025
[62]

X. Wang, G. Chen, S. Dingjie, Z. Zhiyi, Z. Chen, Q. Xiao, J. Chen, F. Jiang, J. Li, X. Wan, et al. Cmb: A comprehensive medical benchmark in chinese. InProceedings of the 2024 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 6184–6205, 2024. 2

2024
[63]

X. Wang, R. Huang, Z. Jin, T. Fang, and H. Qu. Commonsensevis: Visual- izing and understanding commonsense reasoning capabilities of natural language models.IEEE Transactions on Visualization and Computer Graphics, 30(1):273–283, 2023. 2, 3

2023
[64]

J. Wu, W. Deng, X. Li, S. Liu, T. Mi, Y . Peng, Z. Xu, Y . Liu, H. Cho, C.-I. Choi, et al. Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993, 2025. 2, 3, 4, 5, 9

work page arXiv 2025
[65]

T. Wu, M. T. Ribeiro, J. Heer, and D. S. Weld. Errudite: Scalable, repro- ducible, and testable error analysis. InProceedings of the 57th annual meeting of the association for computational linguistics, pp. 747–763,
[66]

R. Xu, P. Jiang, L. Luo, C. Xiao, A. Cross, S. Pan, J. Sun, and C. Yang. A survey on unifying large language models and knowledge graphs for biomedicine and healthcare. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pp. 6195– 6205, 2025. 3

2025
[67]

W. Yang, L. Some, M. Bain, and B. Kang. A comprehensive survey on integrating large language models with knowledge-based methods. Knowledge-Based Systems, 318:113503, 2025. 3

2025
[68]

C. Yeh, Y . Chen, A. Wu, C. Chen, F. Viégas, and M. Wattenberg. Atten- tionviz: A global view of transformer attention.IEEE Transactions on Visualization and Computer Graphics, 30(1):262–272, 2023. 2

2023
[69]

J. Yuan, J. Vig, and N. Rajani. isea: An interactive pipeline for semantic error analysis of nlp models. InProceedings of the 27th International Conference on Intelligent User Interfaces, pp. 878–888, 2022. 2

2022
[70]

S. Zhou, W. Xie, J. Li, Z. Zhan, M. Song, H. Yang, C. Espinoza, L. Welton, X. Mai, Y . Jin, et al. Automating expert-level medical reasoning evaluation of large language models.npj Digital Medicine, 2025. 2 11

2025