Recognition: unknown
VeriLLMed: Interactive Visual Debugging of Medical Large Language Models with Knowledge Graphs
Pith reviewed 2026-05-08 08:00 UTC · model grok-4.3
The pith
VeriLLMed converts medical LLM outputs into reasoning paths and compares them against biomedical knowledge graph references to expose three classes of diagnostic errors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VeriLLMed transforms model outputs into comparable reasoning paths, constructs knowledge graph-grounded reference paths, and identifies three recurring classes of diagnosis errors: relation errors, branch errors, and missing errors. Case studies and expert evaluation demonstrate that VeriLLMed helps developers identify clinically implausible reasoning and generate actionable insights that can inform the improvement of medical LLMs.
What carries the argument
The VeriLLMed visual analytics system that converts LLM diagnostic outputs into reasoning paths and aligns them with biomedical knowledge graph reference paths to classify errors.
If this is right
- Developers gain a way to move from isolated manual reviews to pattern-based inspection across many cases.
- Three specific error categories become visible targets for model retraining or prompt engineering.
- Insights from the system can directly guide iterative improvements to medical LLMs before they reach clinical use.
- Prioritization of which failures to fix first becomes possible when recurring patterns are quantified.
Where Pith is reading between the lines
- The same path-comparison approach could be adapted to debug LLMs in other high-stakes fields such as legal reasoning or financial forecasting.
- Keeping the reference knowledge graphs current would require ongoing maintenance pipelines as medical knowledge evolves.
- The three error classes might serve as labels for new training objectives that penalize relation, branch, and missing mistakes explicitly.
Load-bearing premise
The selected biomedical knowledge graphs supply accurate, complete, and current reference paths that correctly represent proper diagnostic reasoning in the tested cases.
What would settle it
An independent panel of clinicians reviewing the same model outputs could find that the errors flagged by VeriLLMed do not match actual clinical implausibility or that the reference paths themselves contain inaccuracies.
Figures
read the original abstract
Large language models (LLMs) show promise in medical diagnosis, but real-world deployment remains challenging due to high-stakes clinical decisions and imperfect reasoning reliability. As a result, careful inspection of model behavior is essential for assessing whether diagnostic reasoning is reliable and clinically grounded. However, debugging medical LLMs remains difficult. First, developers often lack sufficient medical domain expertise to interpret model errors in clinically meaningful terms. Second, models can fail across a large and diverse set of instances involving different input types, tasks, and reasoning steps, making it challenging for developers to prioritize which errors deserve focused inspection. Third, developers struggle to identify recurring error patterns across cases, as existing debugging practices are largely instance-centric and rely on manual inspection of isolated failures. To address these challenges, we present VeriLLMed, a visual analytics system that integrates external biomedical knowledge to audit and debug medical LLM diagnostic reasoning. VeriLLMed transforms model outputs into comparable reasoning paths, constructs knowledge graph-grounded reference paths, and identifies three recurring classes of diagnosis errors: relation errors, branch errors, and missing errors. Case studies and expert evaluation demonstrate that VeriLLMed helps developers identify clinically implausible reasoning and generate actionable insights that can inform the improvement of medical LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents VeriLLMed, a visual analytics system that integrates biomedical knowledge graphs to audit medical LLM diagnostic reasoning. It transforms model outputs into comparable reasoning paths, constructs KG-grounded reference paths, classifies errors into relation errors, branch errors, and missing errors, and uses interactive visualizations to help developers identify recurring patterns. Case studies and expert evaluation are offered as evidence that the system enables identification of clinically implausible reasoning and generation of actionable insights for LLM improvement.
Significance. If the KG reference paths reliably proxy correct clinical reasoning, VeriLLMed could meaningfully address the three stated challenges (domain expertise gaps, error prioritization across instances, and pattern detection) by providing a structured, visual debugging workflow for medical LLMs. The error taxonomy and visual interface represent a practical contribution to LLM auditing tools, though the absence of quantitative validation metrics limits the strength of the supporting evidence.
major comments (2)
- [System overview and error classification (sections describing reference path construction and error taxonomy)] The central claim that VeriLLMed identifies 'clinically implausible reasoning' rests on the assumption that the chosen biomedical KGs yield reference paths that accurately and completely encode correct diagnostic reasoning. No quantitative comparison of KG-derived paths against independent clinician-annotated paths, no coverage statistics for the evaluated cases, and no sensitivity analysis for KG incompleteness or outdated relations are provided; this makes the error classifications (relation/branch/missing) difficult to interpret as clinically grounded rather than KG-specific artifacts.
- [Case studies and expert evaluation] The evaluation relies on case studies and expert evaluation to support claims of actionable insights and identification of implausible reasoning, yet reports no quantitative metrics (e.g., inter-rater agreement, task completion times, error detection rates), participant numbers, study design details, or controls. This leaves the strength of evidence for the central usability and insight-generation claims unclear.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a brief statement of the number of cases, experts, and KGs used in the evaluation to allow readers to gauge scope immediately.
- [Figures and visualizations] Ensure all figures include clear legends, axis labels, and direct references in the text; some visual elements may require additional annotation to distinguish model paths from reference paths at a glance.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of grounding our error classifications and strengthening the evaluation evidence. We address each major comment point-by-point below, indicating planned revisions where they align with the manuscript's scope. We believe these changes will clarify the claims without overstating the current evidence.
read point-by-point responses
-
Referee: [System overview and error classification (sections describing reference path construction and error taxonomy)] The central claim that VeriLLMed identifies 'clinically implausible reasoning' rests on the assumption that the chosen biomedical KGs yield reference paths that accurately and completely encode correct diagnostic reasoning. No quantitative comparison of KG-derived paths against independent clinician-annotated paths, no coverage statistics for the evaluated cases, and no sensitivity analysis for KG incompleteness or outdated relations are provided; this makes the error classifications (relation/branch/missing) difficult to interpret as clinically grounded rather than KG-specific artifacts.
Authors: We agree that the fidelity of KG-derived reference paths as proxies for correct clinical reasoning is a foundational assumption, and that the absence of quantitative validation metrics makes it harder to interpret the error classifications as fully clinically grounded. The manuscript (Section 3.2) constructs reference paths from established biomedical KGs and defines the three error classes explicitly as deviations from these paths; the case studies then illustrate how these deviations surface implausible steps. The expert evaluation provides qualitative confirmation that experts viewed the flagged errors as clinically relevant. However, we did not include coverage statistics, sensitivity analysis, or direct comparisons to independent clinician annotations in the original submission. In the revised version, we will add a new limitations subsection in the Discussion that reports coverage statistics for the evaluated cases (derived from the KG) and discusses potential impacts of KG incompleteness. We will also explicitly scope the claims to 'KG-grounded' rather than claiming absolute clinical correctness. A full quantitative clinician-annotation study lies beyond the current work's resources and is noted as future work. revision: partial
-
Referee: [Case studies and expert evaluation] The evaluation relies on case studies and expert evaluation to support claims of actionable insights and identification of implausible reasoning, yet reports no quantitative metrics (e.g., inter-rater agreement, task completion times, error detection rates), participant numbers, study design details, or controls. This leaves the strength of evidence for the central usability and insight-generation claims unclear.
Authors: We acknowledge that the Evaluation section would benefit from greater transparency on study details to better support the claims of actionable insights. The manuscript presents two detailed case studies showing how the visual interface reveals recurring error patterns, followed by expert evaluation in which domain experts interacted with the system to audit LLM outputs and generate improvement suggestions. To strengthen this, we will revise the section to specify the number of expert participants, the semi-structured interview protocol used, and any observed consistency in the insights generated. While the original study did not collect quantitative metrics such as inter-rater agreement or task completion times (focusing instead on qualitative demonstration of the workflow), we will add an explicit statement of this scope and note that controlled quantitative usability testing remains valuable future work. These additions will make the evidence base clearer without changing the qualitative nature of the reported findings. revision: partial
- A full quantitative validation comparing KG-derived paths against independent clinician-annotated reference paths would require new data collection and annotation effort that is not feasible within the timeline or scope of this revision.
Circularity Check
No significant circularity; system design validated externally
full rationale
The paper describes a visual analytics system (VeriLLMed) that transforms LLM outputs into reasoning paths, builds reference paths from external biomedical KGs, and classifies errors into relation/branch/missing types. Central claims rest on case studies and expert evaluation demonstrating utility for identifying implausible reasoning. No mathematical derivations, equations, fitted parameters, or predictions appear in the text. No self-citations serve as load-bearing uniqueness theorems, and no ansatzes or renamings reduce the result to its inputs by construction. The KG reference paths and expert judgments function as external benchmarks, keeping the evaluation self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Biomedical knowledge graphs provide reliable reference paths for correct diagnostic reasoning.
invented entities (1)
-
Relation errors, branch errors, and missing errors
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Abu-Salih, M
B. Abu-Salih, M. Al-Qurishi, M. Alweshah, M. Al-Smadi, R. Alfayez, and H. Saadeh. Healthcare knowledge graph construction: A systematic review of the state-of-the-art, open issues, and opportunities.Journal of Big Data, 10(1):81, 2023. 9
2023
-
[2]
Agrawal, S
M. Agrawal, S. Hegselmann, H. Lang, Y . Kim, and D. Sontag. Large language models are few-shot clinical information extractors. InProceed- ings of the 2022 conference on empirical methods in natural language processing, pp. 1998–2022, 2022. 4
2022
-
[3]
H. S. Al Khatib, S. Neupane, H. Kumar Manchukonda, N. A. Golilarz, S. Mittal, A. Amirlatifi, and S. Rahimi. Patient-centric knowledge graphs: a survey of current methods, challenges, and applications.Frontiers in Artificial Intelligence, 7:1388479, 2024. 3
2024
-
[4]
Balachandran
A. Balachandran. Fine-Tuned Embedding Models for Medi- cal / Clinical IR. https://huggingface.co/blog/abhinand/ medembed-finetuned-embedding-models-for-medical-ir , 2024. Hugging Face blog, accessed March 14, 2026. 4
2024
-
[5]
D. Bang, S. Lim, S. Lee, and S. Kim. Biomedical knowledge graph learning for drug repurposing by extending guilt-by-association to multiple layers.Nature Communications, 14(1):3570, 2023. 3
2023
-
[6]
Battogtokh, C
M. Battogtokh, C. Davidescu, M. Luck, and R. Borgo. Semla: A visual analysis system for fine-grained text classification. InProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 23772–23774,
-
[7]
S. Bedi, Y . Jiang, P. Chung, S. Koyejo, and N. Shah. Fidelity of medical reasoning in large language models.JAMA Network Open, 8(8):e2526021,
-
[8]
A. M. Brasoveanu, A. Scharl, L. J. Nixon, and R. Andonie. Visualiz- ing large language models: A brief survey. In2024 28th International Conference Information Visualisation (IV), pp. 236–245. IEEE, 2024. 2
2024
-
[9]
X. Chen, J. Xiang, S. Lu, Y . Liu, M. He, and D. Shi. Evaluating large language models and agents in healthcare: key challenges in clinical applications.Intelligent Medicine, 5(02):151–163, 2025. 2
2025
- [10]
-
[11]
H. Cui, J. Lu, R. Xu, S. Wang, W. Ma, Y . Yu, S. Yu, X. Kan, C. Ling, L. Zhao, et al. A review on knowledge graphs for healthcare: Resources, applications, and promises.Journal of biomedical informatics, p. 104861,
-
[12]
Y . Gao, R. Li, E. Croxford, J. Caskey, B. W. Patterson, M. Churpek, T. Miller, D. Dligach, and M. Afshar. Leveraging medical knowledge graphs into large language models for diagnosis prediction: Design and application study.Jmir Ai, 4:e58670, 2025. 3
2025
-
[13]
Grover and J
A. Grover and J. Leskovec. node2vec: Scalable feature learning for net- works. InProceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864, 2016. 6
2016
-
[14]
C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger. On calibration of modern neural networks. InInternational conference on machine learning, pp. 1321–1330. PMLR, 2017. 2
2017
-
[15]
C. T. Hoyt, D. Domingo-Fernández, R. Aldisi, L. Xu, K. Kolpeja, S. Spalek, E. Wollert, J. Bachman, B. M. Gyori, P. Greene, et al. Re- curation and rational enrichment of knowledge graphs in biological ex- pression language.Database, 2019:baz068, 2019. 9
2019
-
[16]
Y . Hu, Q. Chen, J. Du, X. Peng, V . K. Keloth, X. Zuo, Y . Zhou, Z. Li, X. Jiang, Z. Lu, et al. Improving large language models for clinical named entity recognition via prompt engineering.Journal of the American Medical Informatics Association, 31(9):1812–1820, 2024. 4
2024
-
[17]
Huang, J
X. Huang, J. Zhang, Z. Xu, L. Ou, and J. Tong. A knowledge graph based question answering method for medical domain.PeerJ Computer Science, 7:e667, 2021. 3
2021
-
[18]
Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023. 2
2023
-
[19]
M. Jia, J. Duan, Y . Song, and J. Wang. medikal: Integrating knowl- edge graphs as assistants of llms for enhanced clinical diagnosis on emrs. InProceedings of the 31st International Conference on Computational Linguistics, pp. 9278–9298, 2025. 3
2025
-
[20]
D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421,
-
[21]
Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 2567–2577, 2019. 1, 2
2019
-
[22]
Language Models (Mostly) Know What They Know
S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[23]
M. Khodadad, A. S. Kasmaee, M. Astaraki, and H. Mahyar. Towards domain specification of embedding models in medicine.arXiv preprint arXiv:2507.19407, 2025. 4
-
[24]
Y . Kim, J. Wu, Y . Abdulle, and H. Wu. Medexqa: Medical question answering benchmark with multiple explanations. InProceedings of the 23rd Workshop on biomedical natural language processing, pp. 167–181,
-
[25]
Lacerda, G
A. Lacerda, G. Pappa, A. C. M. Pereira, W. Meira Jr, and A. G. de Almeida Barros. Evaluation of medical large language models: taxon- omy, review, and directions. InProceedings of the Thirty-Fourth Interna- tional Joint Conference on Artificial Intelligence, pp. 10528–10536, 2025. 2
2025
-
[26]
Measuring Faithfulness in Chain-of-Thought Reasoning
T. Lanham, A. Chen, A. Radhakrishnan, B. Steiner, C. Denison, D. Her- nandez, D. Li, E. Durmus, E. Hubinger, J. Kernion, et al. Measuring faith- fulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702,
work page internal anchor Pith review arXiv
-
[27]
Lightman, V
H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023. 2, 4
2023
-
[28]
M. Liu, W. Hu, J. Ding, J. Xu, X. Li, L. Zhu, Z. Bai, X. Shi, B. Wang, H. Song, et al. Medbench: A comprehensive, standardized, and reli- able benchmarking system for evaluating chinese medical large language models.Big Data Mining and Analytics, 7(4):1116–1128, 2024. 2
2024
- [29]
-
[30]
K. Lyu, Y . Tian, Y . Shang, T. Zhou, Z. Yang, Q. Liu, X. Yao, P. Zhang, J. Chen, and J. Li. Causal knowledge graph construction and evaluation for clinical decision support of diabetic nephropathy.Journal of Biomedical Informatics, 139:104298, 2023. 3
2023
-
[31]
L. G. McCoy, R. Swamy, N. Sagar, M. Wang, S. Bacchi, J. M. N. Fong, N. C. Tan, K. Tan, T. A. Buckley, P. Brodeur, et al. Assessment of large language models in clinical reasoning: a novel benchmarking study.NEJM AI, 2(10):AIdbp2500120, 2025. 2
2025
-
[32]
Mohseni, N
S. Mohseni, N. Zarei, and E. D. Ragan. A multidisciplinary survey and framework for design and evaluation of explainable ai systems.ACM Transactions on Interactive Intelligent Systems (TiiS), 11(3-4):1–45, 2021. 3
2021
- [33]
-
[34]
GPT-5 mini Model
OpenAI. GPT-5 mini Model. https://developers.openai.com/ api/docs/models/gpt-5-mini , 2025. OpenAI API documentation, accessed March 14, 2026. 4
2025
-
[35]
Structured model outputs
OpenAI. Structured model outputs. https://developers.openai. com/api/docs/guides/structured-outputs/, 2025. OpenAI API documentation, accessed March 14, 2026. 4
2025
-
[36]
Introducing OpenAI for Healthcare
OpenAI. Introducing OpenAI for Healthcare. https://openai.com/ index/openai-for-healthcare/ , 2026. OpenAI webpage, January 8, 2026, accessed March 14, 2026. 4
2026
-
[37]
OpenAI. Models. https://developers.openai.com/api/docs/ models, 2026. OpenAI API documentation, accessed March 14, 2026. 4
2026
-
[38]
A. Pal, L. K. Umapathi, and M. Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on health, inference, and learning, pp. 248–260. PMLR,
- [39]
-
[40]
P. Qiu, C. Wu, S. Liu, Y . Fan, W. Zhao, Z. Chen, H. Gu, C. Peng, Y . Zhang, Y . Wang, et al. Quantifying the reasoning abilities of llms on clinical cases. Nature Communications, 16(1):9799, 2025. 2
2025
-
[41]
Rajani, W
N. Rajani, W. Liang, L. Chen, M. Mitchell, and J. Zou. Seal: Interactive 10 © 2026 IEEE. This is the author’s version of the article that has been published in IEEE Transactions on Visualization and Computer Graphics. The final version of this record is available at: xx.xxxx/TVCG.201x.xxxxxxx/ tool for systematic error analysis and labeling. InProceedings...
2026
-
[42]
M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. Beyond accuracy: Behavioral testing of nlp models with checklist. InProceedings of the 58th annual meeting of the association for computational linguistics, pp. 4902–4912, 2020. 2
2020
-
[43]
Sarti, N
G. Sarti, N. Feldhus, L. Sickert, and O. Van Der Wal. Inseq: An inter- pretability toolkit for sequence generation models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pp. 421–435, 2023. 2
2023
-
[44]
R. SHENG. A survey of llm-based multi-agent systems in medicine. Authorea Preprints, 2025. 1
2025
- [45]
-
[46]
Singhal, S
K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023. 1, 2, 4
2023
-
[47]
Singhal, T
K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewis, et al. Toward expert-level med- ical question answering with large language models.Nature medicine, 31(3):943–950, 2025. 1, 2, 4
2025
-
[48]
Sivarajkumar, M
S. Sivarajkumar, M. Kelley, A. Samolyk-Mazzanti, S. Visweswaran, and Y . Wang. An empirical evaluation of prompting strategies for large lan- guage models in zero-shot clinical natural language processing: algorithm development and validation study.JMIR Medical Informatics, 12:e55318,
-
[49]
Soman, P
K. Soman, P. W. Rose, J. H. Morris, R. E. Akbas, B. Smith, B. Peetoom, C. Villouta-Reyes, G. Cerono, Y . Shi, A. Rizk-Jackson, et al. Biomedical knowledge graph-optimized prompt generation for large language models. Bioinformatics, 40(9):btae560, 2024. 3
2024
-
[50]
Spinner, U
T. Spinner, U. Schlegel, H. Schäfer, and M. El-Assady. explainer: A visual analytics framework for interactive and explainable machine learning. IEEE transactions on visualization and computer graphics, 26(1):1064– 1074, 2019. 3, 4
2019
-
[51]
Strobelt, S
H. Strobelt, S. Gehrmann, M. Behrisch, A. Perer, H. Pfister, and A. M. Rush. S eq 2s eq-v is: A visual debugging tool for sequence-to-sequence models.IEEE transactions on visualization and computer graphics, 25(1):353–363, 2018. 2, 3, 4
2018
-
[52]
Strobelt, S
H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush. Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE transactions on visualization and computer graphics, 24(1):667–676,
-
[53]
C. Su, Y . Hou, M. Zhou, S. Rajendran, J. R. Maasch, Z. Abedi, H. Zhang, Z. Bai, A. Cuturrufo, W. Guo, et al. Biomedical discovery through the integrative biomedical knowledge hub (ibkh).Iscience, 26(4), 2023. 2, 3, 4, 9
2023
-
[54]
T. Y . C. Tam, S. Sivarajkumar, S. Kapoor, A. V . Stolyar, K. Polanska, K. R. McCarthy, H. Osterhoudt, X. Wu, S. Visweswaran, S. Fu, et al. A framework for human evaluation of large language models in healthcare derived from literature review.NPJ digital medicine, 7(1):258, 2024. 2
2024
-
[55]
Tenney, J
I. Tenney, J. Wexler, J. Bastings, T. Bolukbasi, A. Coenen, S. Gehrmann, E. Jiang, M. Pushkarna, C. Radebaugh, E. Reif, et al. The language interpretability tool: Extensible, interactive visualizations and analysis for nlp models. InProceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 107–118,
2020
-
[56]
Tufanov, K
I. Tufanov, K. Hambardzumyan, J. Ferrando, and E. V oita. Lm trans- parency tool: Interactive tool for analyzing transformer language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 3: System Demonstrations), pp. 51–60, 2024. 2
2024
-
[57]
Turpin, J
M. Turpin, J. Michael, E. Perez, and S. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of- thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023. 1, 2
2023
-
[58]
Van der Maaten and G
L. Van der Maaten and G. Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008. 6
2008
-
[59]
J. Vig. A multiscale visualization of attention in the transformer model. InProceedings of the 57th annual meeting of the association for computa- tional linguistics: system demonstrations, pp. 37–42, 2019. 2
2019
-
[60]
J. Wang, S. Liu, and W. Zhang. Visual analytics for machine learning: A data perspective survey.IEEE transactions on visualization and computer graphics, 30(12):7637–7656, 2024. 2
2024
- [61]
-
[62]
X. Wang, G. Chen, S. Dingjie, Z. Zhiyi, Z. Chen, Q. Xiao, J. Chen, F. Jiang, J. Li, X. Wan, et al. Cmb: A comprehensive medical benchmark in chinese. InProceedings of the 2024 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 6184–6205, 2024. 2
2024
-
[63]
X. Wang, R. Huang, Z. Jin, T. Fang, and H. Qu. Commonsensevis: Visual- izing and understanding commonsense reasoning capabilities of natural language models.IEEE Transactions on Visualization and Computer Graphics, 30(1):273–283, 2023. 2, 3
2023
- [64]
-
[65]
T. Wu, M. T. Ribeiro, J. Heer, and D. S. Weld. Errudite: Scalable, repro- ducible, and testable error analysis. InProceedings of the 57th annual meeting of the association for computational linguistics, pp. 747–763,
-
[66]
R. Xu, P. Jiang, L. Luo, C. Xiao, A. Cross, S. Pan, J. Sun, and C. Yang. A survey on unifying large language models and knowledge graphs for biomedicine and healthcare. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pp. 6195– 6205, 2025. 3
2025
-
[67]
W. Yang, L. Some, M. Bain, and B. Kang. A comprehensive survey on integrating large language models with knowledge-based methods. Knowledge-Based Systems, 318:113503, 2025. 3
2025
-
[68]
C. Yeh, Y . Chen, A. Wu, C. Chen, F. Viégas, and M. Wattenberg. Atten- tionviz: A global view of transformer attention.IEEE Transactions on Visualization and Computer Graphics, 30(1):262–272, 2023. 2
2023
-
[69]
J. Yuan, J. Vig, and N. Rajani. isea: An interactive pipeline for semantic error analysis of nlp models. InProceedings of the 27th International Conference on Intelligent User Interfaces, pp. 878–888, 2022. 2
2022
-
[70]
S. Zhou, W. Xie, J. Li, Z. Zhan, M. Song, H. Yang, C. Espinoza, L. Welton, X. Mai, Y . Jin, et al. Automating expert-level medical reasoning evaluation of large language models.npj Digital Medicine, 2025. 2 11
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.