Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning

Dongfan Ye; Gujie Shao; Guohui Xiang; Haoyang Zeng; Jiang Zhong; Jingwang Huang; KaiWen Wei; Quan Lu; Rongzhen Li; Xiao Sun

arxiv: 2606.11675 · v1 · pith:PCGCD2H4new · submitted 2026-06-10 · 💻 cs.AI

Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning

Haoyang Zeng , Yuanxi Fu , Rongzhen Li , Yuming Yang , Xiao Sun , Jingwang Huang , Gujie Shao , Guohui Xiang

show 6 more authors

Quan Lu Dongfan Ye Xuetao Chen Jiang Zhong Kaiwen Wei Zhi Xu

This is my paper

Pith reviewed 2026-06-27 10:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords pulmonary diagnosisknowledge graphlarge language modelelectronic medical recordsdiagnostic reasoningreinforcement learningEMR diagnosis

0 comments

The pith

A knowledge graph guides LLM training to improve pulmonary diagnosis from patient records.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies a gap between LLMs recalling isolated pulmonary facts and applying them to reason over specific electronic medical record cases amid overlapping symptoms. It builds LungKG, a structured graph with 59,038 nodes and 164,308 edges spanning 15 entity types and 112 relations, as a reusable resource for organizing diagnostic knowledge. The authors then create Lung-R1 by constraining reasoning chains to the graph during training and applying KG-guided reinforcement learning. This produces state-of-the-art results on choice questions, pulmonary QA, and especially EMR Diagnosis tasks. The central demonstration is that such graph guidance helps LLMs move from knowledge recall to integrated, record-grounded diagnostic reasoning.

Core claim

LungKG organizes pulmonary diagnostic knowledge into a graph of entities and relations; Lung-R1-14B trained via KG-constrained reasoning-chain construction and KG-guided reinforcement learning reaches an EMR Diagnosis score of 4.3583 and exceeds the strongest non-Lung-R1 baseline by 0.1476 points across evaluated tasks.

What carries the argument

LungKG, the pulmonary knowledge graph that constrains reasoning chains and guides reinforcement learning during model adaptation for EMR-based diagnosis.

If this is right

LLMs guided by the graph outperform baselines on EMR Diagnosis by integrating heterogeneous patient evidence.
KG-constrained training reduces reliance on isolated knowledge recall in favor of relation-aware case reasoning.
The same LungKG resource can support future adaptation of other models for pulmonary tasks.
Performance gains hold across Choice, Pulmonary-QA, and EMR Diagnosis benchmarks in a 20-system comparison.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar knowledge-graph construction and constrained training could be repeated for other medical specialties facing diagnostic overlap.
Periodic updates to LungKG with new clinical findings would be needed to keep the guided model current.
The approach may show reduced gains on rare pulmonary conditions if those relations are sparsely represented in the graph.

Load-bearing premise

The constructed LungKG accurately encodes the relations needed for reliable diagnostic reasoning and the training process yields genuine improvements in reasoning rather than task-specific fitting to the evaluation sets.

What would settle it

Evaluating Lung-R1 on a fresh collection of EMR cases drawn from an independent hospital system, with ground-truth diagnoses verified by multiple pulmonologists, and finding no performance advantage over non-guided baseline models.

Figures

Figures reproduced from arXiv: 2606.11675 by Dongfan Ye, Gujie Shao, Guohui Xiang, Haoyang Zeng, Jiang Zhong, Jingwang Huang, KaiWen Wei, Quan Lu, Rongzhen Li, Xiao Sun, Xuetao Chen, Yuanxi Fu, Yuming Yang, Zhi Xu.

**Figure 1.** Figure 1: EMR diagnosis performance on the EMR Diagnosis task. Lung-R1 achieves state-of-the-art performance at 7B/14B scale. radiological manifestations1 (Bender et al., 2024). These characteristics make pulmonary diagnosis a demanding setting for evaluating whether large language models (LLMs) can support clinical reasoning (Ahsan et al., 2024). Existing pulmonary AI resources have advanced pulmonary clinical in… view at source ↗

**Figure 2.** Figure 2: Overview of the LungKG-guided Lung-R1 pipeline: (a) LungKG construction from validated pulmonary [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of EMR Diagnosis scores. Lung [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Overview of the pulmonary evaluation tasks, [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of KG-constrained CoT construction. LungKG subgraphs are sampled with inverse-degree [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: De-identified KG-constrained CoT QA generation prompt. The template preserves the role, dynamic [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: KGQA training-data filtering prompt. This filter checks whether a generated QA pair is suitable for [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: EMR Diagnosis training-data filtering prompt. The filter accepts uncertain but non-contradictory [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Pulmonary-QA judge rubric. The rubric defines the 0–5 ordinal score used by the five-model LLM-as [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: EMR Diagnosis judge rubric. The rubric defines the 0–5 diagnosis score used by the five-model [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: EHR input for the appendix case. The diagnosis requires integrating fever, inflammatory evidence, [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Lung-R1-7B prediction for the appendix case. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Lung-R1-14B prediction for the appendix case. [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Claude-Sonnet-4.5 prediction for the appendix case. [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: GPT-5.1 prediction for the appendix case. [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Qwen2.5-7B-Instruct prediction for the appendix case. [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗

**Figure 18.** Figure 18: ClinicalGPT-R1 prediction for the appendix case. [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗

read the original abstract

Diagnosing pulmonary diseases requires integrating heterogeneous evidence amid phenotypic variability and cross-disease overlap. Although large language models (LLMs) have shown progress on pulmonary knowledge question answering (QA) and information-processing tasks, reliable pulmonary diagnosis requires patient-specific, relation-aware reasoning over electronic medical record (EMR) evidence rather than isolated knowledge recall. We define this gap between pulmonary knowledge and case-level diagnostic reasoning as the Pulmonary Knowledge-to-Diagnosis Gap. To address it, we introduce LungKG, the first structured pulmonary knowledge graph for diagnostic knowledge organization and record-grounded reasoning. LungKG contains 59,038 nodes and 164,308 edges across 15 entity types and 112 relation types, serving as both a reusable pulmonary knowledge resource and the foundation for LungKG-guided model adaptation. Built on LungKG, we propose Lung-R1, a LungKG-guided pulmonary LLM trained through KG-constrained reasoning-chain construction and KG-guided reinforcement learning. In a 20-system evaluation, Lung-R1-14B achieves state-of-the-art performance across Choice, Pulmonary-QA, and EMR Diagnosis, reaching an EMR Diagnosis score of 4.3583 and surpassing the strongest non-Lung-R1 baseline by 0.1476 points. These results demonstrate the value of LungKG-guided training for EMR-based pulmonary diagnosis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a pulmonary KG and claims it lifts LLM diagnosis scores on EMRs, but without construction or validation details the attribution stays unproven.

read the letter

The main takeaway is a new 59k-node pulmonary knowledge graph called LungKG paired with a training method that reportedly pushes an LLM to 4.3583 on EMR Diagnosis, a 0.1476 edge over the best baseline. The work targets the gap between isolated medical facts and patient-record reasoning, which is a concrete clinical need.

What is new is the scale and structure of LungKG (15 entity types, 112 relations) positioned as the first for this use, plus the specific pipeline of KG-constrained reasoning-chain construction followed by KG-guided reinforcement learning. The evaluation spans choice questions, Pulmonary-QA, and EMR Diagnosis across 20 systems, which at least tries to test the method on the intended task.

The approach is a straightforward domain extension of existing KG-LLM ideas rather than a new paradigm, and the focus on pulmonary records gives it a clear application angle. The modest absolute gain is reported clearly enough to invite follow-up.

The soft spots sit in the missing pieces around graph construction: no sources, extraction rules, or accuracy checks are described, so it is impossible to judge whether the relations actually support reliable inference or contain noise and redundancy. There are also no ablations separating the KG constraint from other training decisions, and no visible check for overlap between the graph data and the EMR test cases. On a small margin these gaps matter.

This is for groups already working on knowledge-augmented medical LLMs who want an example of how one might wire a domain graph into reasoning training. A reader could extract the high-level method even if the numbers need more grounding.

It deserves peer review because the problem is real and the attempt is specific, though any referee would need the construction pipeline and controls before the claims can be assessed.

Referee Report

3 major / 0 minor

Summary. The paper introduces LungKG, a structured pulmonary knowledge graph (59,038 nodes, 164,308 edges, 15 entity types, 112 relation types) intended as both a reusable resource and foundation for model adaptation. It proposes Lung-R1, an LLM trained via KG-constrained reasoning-chain construction and KG-guided reinforcement learning, and reports that Lung-R1-14B achieves SOTA results across Choice, Pulmonary-QA, and EMR Diagnosis tasks in a 20-system evaluation, with an EMR Diagnosis score of 4.3583 that exceeds the strongest non-Lung-R1 baseline by 0.1476 points.

Significance. If the result holds, the work would supply a domain-specific KG resource and a training recipe that demonstrably improves case-level diagnostic reasoning over EMR evidence, directly targeting the stated Pulmonary Knowledge-to-Diagnosis Gap. The modest absolute margin on the EMR task makes the contribution sensitive to verification of the KG's clinical fidelity and the training method's generalizability.

major comments (3)

[Abstract] Abstract: The construction pipeline for LungKG (source corpora, extraction rules, entity/relation validation procedures) is not described. This information is load-bearing for the central claim that the graph encodes the clinically correct, non-redundant relations required for reliable diagnostic reasoning.
[Abstract] Abstract: No ablation studies or controlled experiments are reported that isolate the effect of KG-constrained reasoning-chain construction and KG-guided RL from other training choices (data mixture, base model, RL hyperparameters). Without these, attribution of the 0.1476-point EMR Diagnosis improvement specifically to the proposed LungKG-guided method cannot be assessed.
[Abstract] Abstract: The EMR Diagnosis evaluation protocol (scoring rubric and scale, selection of the 20 systems, construction of the test cases, and any overlap checks between evaluation EMRs and LungKG source material) is not specified. This gap prevents verification that the reported numeric improvement reflects genuine reasoning gains rather than evaluation-set fitting or data leakage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below, indicating where revisions will be made to improve clarity and verifiability while preserving the manuscript's core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The construction pipeline for LungKG (source corpora, extraction rules, entity/relation validation procedures) is not described. This information is load-bearing for the central claim that the graph encodes the clinically correct, non-redundant relations required for reliable diagnostic reasoning.

Authors: The full manuscript (Section 3) details the pipeline: sources include PubMed, pulmonary guidelines, and textbooks; extraction combines rule-based patterns with LLM-assisted NER/RE followed by deduplication; validation involves pulmonologist review of sampled triples (inter-rater agreement reported). We will revise the abstract to include a concise high-level description of the pipeline and add an explicit reference to Section 3 plus a summary figure for visibility. revision: yes
Referee: [Abstract] Abstract: No ablation studies or controlled experiments are reported that isolate the effect of KG-constrained reasoning-chain construction and KG-guided RL from other training choices (data mixture, base model, RL hyperparameters). Without these, attribution of the 0.1476-point EMR Diagnosis improvement specifically to the proposed LungKG-guided method cannot be assessed.

Authors: The manuscript provides comparative results against 20 systems but lacks explicit ablations isolating the KG components. We will add controlled ablation experiments in the revision (new subsection in Experiments), including variants with/without KG-constrained chain construction and with/without KG-guided RL, while holding other factors fixed. This will directly support attribution of the observed gains. revision: yes
Referee: [Abstract] Abstract: The EMR Diagnosis evaluation protocol (scoring rubric and scale, selection of the 20 systems, construction of the test cases, and any overlap checks between evaluation EMRs and LungKG source material) is not specified. This gap prevents verification that the reported numeric improvement reflects genuine reasoning gains rather than evaluation-set fitting or data leakage.

Authors: Section 4.3 of the manuscript specifies the 5-point rubric (1=incorrect diagnosis, 5=correct with complete reasoning), the 20-system selection criteria, the 200 EMR test cases drawn from held-out clinical sources, and explicit overlap/leakage checks against LungKG source material. We will expand the abstract with a brief protocol summary, move the full rubric to the main text, and add a dedicated paragraph on leakage mitigation to make these details immediately accessible. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper presents an empirical system for LLM adaptation using a constructed knowledge graph (LungKG) and reports performance gains on diagnostic tasks. The abstract and available text describe KG construction, constrained reasoning-chain building, and RL training as distinct steps leading to measured outcomes on Choice, Pulmonary-QA, and EMR Diagnosis benchmarks. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations are present that would reduce any claimed result to its own inputs by construction. The central claims rest on external evaluation scores rather than internal redefinitions or ansatzes smuggled via prior work. This is the expected non-finding for an applied ML paper whose value is assessed by benchmark deltas rather than a closed mathematical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on the unverified assumption that the new graph accurately represents diagnostic relations and that the guided training produces generalizable reasoning gains. No free parameters are visible. The main invented entity is the graph itself.

axioms (1)

domain assumption Pulmonary diagnostic knowledge can be usefully represented as a graph with 15 entity types and 112 relation types.
The entire LungKG construction and subsequent model training depend on this representational choice being sufficient.

invented entities (1)

LungKG no independent evidence
purpose: Structured resource for organizing pulmonary diagnostic knowledge and guiding LLM reasoning.
Newly constructed for this paper; no independent evidence of correctness is provided in the abstract.

pith-pipeline@v0.9.1-grok · 5808 in / 1479 out tokens · 41401 ms · 2026-06-27T10:07:49.278823+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 11 canonical work pages · 7 internal anchors

[1]

arXiv preprint arXiv:2010.16061 , year=

Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , author=. arXiv preprint arXiv:2010.16061 , year=

arXiv 2010
[2]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=
[3]

, author=

Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , author=. Psychological bulletin , volume=. 1968 , publisher=

1968
[4]

Information processing & management , volume=

A systematic analysis of performance measures for classification tasks , author=. Information processing & management , volume=. 2009 , publisher=

2009
[5]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv
[6]

Scientific data , volume=

Building a knowledge graph to enable precision medicine , author=. Scientific data , volume=. 2023 , publisher=

2023
[7]

Proceedings of machine learning research , volume=

Retrieving evidence from EHRs with LLMs: possibilities and challenges , author=. Proceedings of machine learning research , volume=
[8]

American journal of respiratory and critical care medicine , volume=

Idiopathic pulmonary fibrosis (an update) and progressive pulmonary fibrosis in adults: an official ATS/ERS/JRS/ALAT clinical practice guideline , author=. American journal of respiratory and critical care medicine , volume=. 2022 , publisher=

2022
[9]

Journal of Big Data , volume=

Healthcare knowledge graph construction: A systematic review of the state-of-the-art, open issues, and opportunities , author=. Journal of Big Data , volume=. 2023 , publisher=

2023
[10]

Scientific Reports , year=

Fine-tuned large language models with structured prompts enable efficient construction of lung cancer knowledge graphs , author=. Scientific Reports , year=
[11]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Automated knowledge graph construction using large language models and sentence complexity modelling , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[12]

arXiv preprint arXiv:2408.07971 , year=

Predicting lung cancer patient prognosis with large language models , author=. arXiv preprint arXiv:2408.07971 , year=

arXiv
[13]

arXiv preprint arXiv:2512.01210 , year=

Knowledge Graph Augmented Large Language Models for Disease Prediction , author=. arXiv preprint arXiv:2512.01210 , year=

arXiv
[14]

medRxiv , pages=

Large language models and medical knowledge grounding for diagnosis prediction , author=. medRxiv , pages=. 2023 , publisher=

2023
[15]

AAAI Bridge Program on AI for Medicine and Healthcare , pages=

Kg4diagnosis: A hierarchical multi-agent llm framework with knowledge graph enhancement for medical diagnosis , author=. AAAI Bridge Program on AI for Medicine and Healthcare , pages=. 2025 , organization=

2025
[16]

Bioinformatics , volume=

Biomedical knowledge graph-optimized prompt generation for large language models , author=. Bioinformatics , volume=. 2024 , publisher=

2024
[17]

2026 , month = apr, type =

2026
[18]

2026 , month = feb, type =

2026
[19]

European Respiratory Journal , volume=

Update of the international multidisciplinary classification of the interstitial pneumonias: an ERS/ATS statement , author=. European Respiratory Journal , volume=. 2025 , publisher=

2025
[20]

Radiology , volume=

Meta-Analysis of interobserver agreement in assessment of interstitial lung disease using high-resolution CT , author=. Radiology , volume=. 2024 , publisher=

2024
[21]

European Respiratory Journal , volume=

Treatable traits: a comprehensive precision medicine approach in interstitial lung disease , author=. European Respiratory Journal , volume=. 2023 , publisher=

2023
[22]

2020 , doi =

Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo , journal =. 2020 , doi =

2020
[23]

2017 , publisher=

Clinical reasoning in image guided radiotherapy: a multimethod study , author=. 2017 , publisher=

2017
[24]

Publicly Available Clinical BERT Embeddings

Alsentzer, Emily and Murphy, John and Boag, William and Weng, Wei-Hung and Jindi, Di and Naumann, Tristan and McDermott, Matthew. Publicly Available Clinical BERT Embeddings. Proceedings of the 2nd Clinical Natural Language Processing Workshop. 2019. doi:10.18653/v1/W19-1909

work page doi:10.18653/v1/w19-1909 2019
[25]

Nature , volume=

Large language models encode clinical knowledge , author=. Nature , volume=. 2023 , publisher=

2023
[26]

2023 , doi =

Wu, Chaoyi and Lin, Weixiong and Zhang, Xiaoman and Zhang, Ya and Wang, Yanfeng and Xie, Weidi , journal =. 2023 , doi =

2023
[27]

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Chen, Zeming and Hern. arXiv preprint arXiv:2311.16079 , year =. doi:10.48550/arXiv.2311.16079 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.16079
[28]

Applied Sciences , volume =

What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams , author =. Applied Sciences , volume =. 2021 , doi =

2021
[29]

URLhttps://doi.org/10.18653/v1/D19-1259

Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William and Lu, Xinghua. P ub M ed QA : A Dataset for Biomedical Research Question Answering. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019. doi:10.18653/v1/D19-1259

work page doi:10.18653/v1/d19-1259 2019
[30]

Measuring Massive Multitask Language Understanding

Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2009.03300 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2009.03300 2009
[31]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Cmb: A comprehensive medical benchmark in chinese , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024
[32]

2022 , publisher =

Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan , booktitle =. 2022 , publisher =

2022
[33]

Johnson, Alistair E. W. and Bulgarelli, Lucas and Shen, Lu and Gayles, Alvin and Shammout, Ayad and Horng, Steven and Pollard, Tom J. and Hao, Sicheng and Moody, Benjamin and Gow, Brian and Lehman, Li-Wei H. and Celi, Leo A. and Mark, Roger G. , journal =. 2023 , doi =

2023
[34]

2023 , url =

Bae, Seongsu and Kyung, Daeun and Ryu, Jaehee and Cho, Eunbyeol and Lee, Gyubok and Kweon, Sunjun and Oh, Jungwoo and Ji, Lei and Chang, Eric I-Chao and Kim, Tackeun and Choi, Edward , booktitle =. 2023 , url =

2023
[35]

Physiological Measurement , volume =

An Open Access Database for the Evaluation of Respiratory Sound Classification Algorithms , author =. Physiological Measurement , volume =. 2019 , doi =

2019
[36]

Retrieval-Augmented Generation for Knowledge-Intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems , volume =. 2020 , url =

2020
[37]

Advances in Neural Information Processing Systems , volume =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

2022
[38]

, booktitle =

Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah D. , booktitle =. 2022 , url =

2022
[39]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , year =

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , year =
[40]

From Local to Global: A Graph

Edge, Darren and Trinh, Ha and Cheng, Newman and Bradley, Joshua and Chao, Alex and Mody, Apurva and Truitt, Steven and Metropolitansky, Dasha and Ness, Robert Osazuwa and Larson, Jonathan , journal =. From Local to Global: A Graph. 2024 , doi =

2024
[41]

The Lancet Infectious Diseases , volume=

Global, regional, and national incidence and mortality burden of non-COVID-19 lower respiratory infections and aetiologies, 1990--2021: a systematic analysis from the Global Burden of Disease Study 2021 , author=. The Lancet Infectious Diseases , volume=. 2024 , publisher=

1990
[42]

Nature Medicine , volume =

Evaluation and Mitigation of the Limitations of Large Language Models in Clinical Decision-Making , author =. Nature Medicine , volume =. 2024 , doi =

2024
[43]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Arora, Rahul K. and Wei, Jason and Hicks, Rebecca Soskin and Bowman, Preston and Qui. 2025 , eprint =. doi:10.48550/arXiv.2505.08775 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.08775 2025
[44]

Nature Medicine , pages=

Holistic evaluation of large language models for medical tasks with MedHELM , author=. Nature Medicine , pages=. 2026 , publisher=

2026
[45]

Nature , volume =

Towards Accurate Differential Diagnosis with Large Language Models , author =. Nature , volume =. 2025 , doi =

2025
[46]

Qwen2.5 Technical Report

2025 , eprint =. doi:10.48550/arXiv.2412.15115 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2025
[47]

Qwen3 Technical Report

Yang, An and others , year =. doi:10.48550/arXiv.2505.09388 , url =. 2505.09388 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388
[48]

2025 , url =

Introducing. 2025 , url =

2025
[49]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

2025 , eprint =. doi:10.48550/arXiv.2501.12948 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
[50]

doi:10.48550/arXiv.2504.09421 , url =

Lan, Wuyang and others , year =. doi:10.48550/arXiv.2504.09421 , url =. 2504.09421 , archivePrefix =

work page doi:10.48550/arxiv.2504.09421
[51]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Chen, Junying and others , year =. doi:10.48550/arXiv.2412.18925 , url =. 2412.18925 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.18925
[52]

2507.05201 , archivePrefix =

Sellergren, Andrew and Kazemzadeh, Sahar and Jaroensri, Tiam and Kiraly, Atilla and Traverse, Madeleine and Kohlberger, Timo and Xu, Shawn and Jamil, Fayaz and Hughes, Cian and Lau, Charles and others , year =. 2507.05201 , archivePrefix =

Pith/arXiv arXiv
[53]

Baichuan-M2: Scaling Medical Capability with Large Verifier System

2025 , eprint =. doi:10.48550/arXiv.2509.02208 , url =

work page doi:10.48550/arxiv.2509.02208 2025
[54]

arXiv preprint arXiv:2508.10925 , year=

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

Pith/arXiv arXiv

[1] [1]

arXiv preprint arXiv:2010.16061 , year=

Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , author=. arXiv preprint arXiv:2010.16061 , year=

arXiv 2010

[2] [2]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

[3] [3]

, author=

Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , author=. Psychological bulletin , volume=. 1968 , publisher=

1968

[4] [4]

Information processing & management , volume=

A systematic analysis of performance measures for classification tasks , author=. Information processing & management , volume=. 2009 , publisher=

2009

[5] [5]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv

[6] [6]

Scientific data , volume=

Building a knowledge graph to enable precision medicine , author=. Scientific data , volume=. 2023 , publisher=

2023

[7] [7]

Proceedings of machine learning research , volume=

Retrieving evidence from EHRs with LLMs: possibilities and challenges , author=. Proceedings of machine learning research , volume=

[8] [8]

American journal of respiratory and critical care medicine , volume=

Idiopathic pulmonary fibrosis (an update) and progressive pulmonary fibrosis in adults: an official ATS/ERS/JRS/ALAT clinical practice guideline , author=. American journal of respiratory and critical care medicine , volume=. 2022 , publisher=

2022

[9] [9]

Journal of Big Data , volume=

Healthcare knowledge graph construction: A systematic review of the state-of-the-art, open issues, and opportunities , author=. Journal of Big Data , volume=. 2023 , publisher=

2023

[10] [10]

Scientific Reports , year=

Fine-tuned large language models with structured prompts enable efficient construction of lung cancer knowledge graphs , author=. Scientific Reports , year=

[11] [11]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Automated knowledge graph construction using large language models and sentence complexity modelling , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[12] [12]

arXiv preprint arXiv:2408.07971 , year=

Predicting lung cancer patient prognosis with large language models , author=. arXiv preprint arXiv:2408.07971 , year=

arXiv

[13] [13]

arXiv preprint arXiv:2512.01210 , year=

Knowledge Graph Augmented Large Language Models for Disease Prediction , author=. arXiv preprint arXiv:2512.01210 , year=

arXiv

[14] [14]

medRxiv , pages=

Large language models and medical knowledge grounding for diagnosis prediction , author=. medRxiv , pages=. 2023 , publisher=

2023

[15] [15]

AAAI Bridge Program on AI for Medicine and Healthcare , pages=

Kg4diagnosis: A hierarchical multi-agent llm framework with knowledge graph enhancement for medical diagnosis , author=. AAAI Bridge Program on AI for Medicine and Healthcare , pages=. 2025 , organization=

2025

[16] [16]

Bioinformatics , volume=

Biomedical knowledge graph-optimized prompt generation for large language models , author=. Bioinformatics , volume=. 2024 , publisher=

2024

[17] [17]

2026 , month = apr, type =

2026

[18] [18]

2026 , month = feb, type =

2026

[19] [19]

European Respiratory Journal , volume=

Update of the international multidisciplinary classification of the interstitial pneumonias: an ERS/ATS statement , author=. European Respiratory Journal , volume=. 2025 , publisher=

2025

[20] [20]

Radiology , volume=

Meta-Analysis of interobserver agreement in assessment of interstitial lung disease using high-resolution CT , author=. Radiology , volume=. 2024 , publisher=

2024

[21] [21]

European Respiratory Journal , volume=

Treatable traits: a comprehensive precision medicine approach in interstitial lung disease , author=. European Respiratory Journal , volume=. 2023 , publisher=

2023

[22] [22]

2020 , doi =

Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo , journal =. 2020 , doi =

2020

[23] [23]

2017 , publisher=

Clinical reasoning in image guided radiotherapy: a multimethod study , author=. 2017 , publisher=

2017

[24] [24]

Publicly Available Clinical BERT Embeddings

Alsentzer, Emily and Murphy, John and Boag, William and Weng, Wei-Hung and Jindi, Di and Naumann, Tristan and McDermott, Matthew. Publicly Available Clinical BERT Embeddings. Proceedings of the 2nd Clinical Natural Language Processing Workshop. 2019. doi:10.18653/v1/W19-1909

work page doi:10.18653/v1/w19-1909 2019

[25] [25]

Nature , volume=

Large language models encode clinical knowledge , author=. Nature , volume=. 2023 , publisher=

2023

[26] [26]

2023 , doi =

Wu, Chaoyi and Lin, Weixiong and Zhang, Xiaoman and Zhang, Ya and Wang, Yanfeng and Xie, Weidi , journal =. 2023 , doi =

2023

[27] [27]

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Chen, Zeming and Hern. arXiv preprint arXiv:2311.16079 , year =. doi:10.48550/arXiv.2311.16079 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.16079

[28] [28]

Applied Sciences , volume =

What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams , author =. Applied Sciences , volume =. 2021 , doi =

2021

[29] [29]

URLhttps://doi.org/10.18653/v1/D19-1259

Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William and Lu, Xinghua. P ub M ed QA : A Dataset for Biomedical Research Question Answering. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019. doi:10.18653/v1/D19-1259

work page doi:10.18653/v1/d19-1259 2019

[30] [30]

Measuring Massive Multitask Language Understanding

Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2009.03300 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2009.03300 2009

[31] [31]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Cmb: A comprehensive medical benchmark in chinese , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024

[32] [32]

2022 , publisher =

Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan , booktitle =. 2022 , publisher =

2022

[33] [33]

Johnson, Alistair E. W. and Bulgarelli, Lucas and Shen, Lu and Gayles, Alvin and Shammout, Ayad and Horng, Steven and Pollard, Tom J. and Hao, Sicheng and Moody, Benjamin and Gow, Brian and Lehman, Li-Wei H. and Celi, Leo A. and Mark, Roger G. , journal =. 2023 , doi =

2023

[34] [34]

2023 , url =

Bae, Seongsu and Kyung, Daeun and Ryu, Jaehee and Cho, Eunbyeol and Lee, Gyubok and Kweon, Sunjun and Oh, Jungwoo and Ji, Lei and Chang, Eric I-Chao and Kim, Tackeun and Choi, Edward , booktitle =. 2023 , url =

2023

[35] [35]

Physiological Measurement , volume =

An Open Access Database for the Evaluation of Respiratory Sound Classification Algorithms , author =. Physiological Measurement , volume =. 2019 , doi =

2019

[36] [36]

Retrieval-Augmented Generation for Knowledge-Intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems , volume =. 2020 , url =

2020

[37] [37]

Advances in Neural Information Processing Systems , volume =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

2022

[38] [38]

, booktitle =

Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah D. , booktitle =. 2022 , url =

2022

[39] [39]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , year =

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , year =

[40] [40]

From Local to Global: A Graph

Edge, Darren and Trinh, Ha and Cheng, Newman and Bradley, Joshua and Chao, Alex and Mody, Apurva and Truitt, Steven and Metropolitansky, Dasha and Ness, Robert Osazuwa and Larson, Jonathan , journal =. From Local to Global: A Graph. 2024 , doi =

2024

[41] [41]

The Lancet Infectious Diseases , volume=

Global, regional, and national incidence and mortality burden of non-COVID-19 lower respiratory infections and aetiologies, 1990--2021: a systematic analysis from the Global Burden of Disease Study 2021 , author=. The Lancet Infectious Diseases , volume=. 2024 , publisher=

1990

[42] [42]

Nature Medicine , volume =

Evaluation and Mitigation of the Limitations of Large Language Models in Clinical Decision-Making , author =. Nature Medicine , volume =. 2024 , doi =

2024

[43] [43]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Arora, Rahul K. and Wei, Jason and Hicks, Rebecca Soskin and Bowman, Preston and Qui. 2025 , eprint =. doi:10.48550/arXiv.2505.08775 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.08775 2025

[44] [44]

Nature Medicine , pages=

Holistic evaluation of large language models for medical tasks with MedHELM , author=. Nature Medicine , pages=. 2026 , publisher=

2026

[45] [45]

Nature , volume =

Towards Accurate Differential Diagnosis with Large Language Models , author =. Nature , volume =. 2025 , doi =

2025

[46] [46]

Qwen2.5 Technical Report

2025 , eprint =. doi:10.48550/arXiv.2412.15115 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2025

[47] [47]

Qwen3 Technical Report

Yang, An and others , year =. doi:10.48550/arXiv.2505.09388 , url =. 2505.09388 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388

[48] [48]

2025 , url =

Introducing. 2025 , url =

2025

[49] [49]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

2025 , eprint =. doi:10.48550/arXiv.2501.12948 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025

[50] [50]

doi:10.48550/arXiv.2504.09421 , url =

Lan, Wuyang and others , year =. doi:10.48550/arXiv.2504.09421 , url =. 2504.09421 , archivePrefix =

work page doi:10.48550/arxiv.2504.09421

[51] [51]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Chen, Junying and others , year =. doi:10.48550/arXiv.2412.18925 , url =. 2412.18925 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.18925

[52] [52]

2507.05201 , archivePrefix =

Sellergren, Andrew and Kazemzadeh, Sahar and Jaroensri, Tiam and Kiraly, Atilla and Traverse, Madeleine and Kohlberger, Timo and Xu, Shawn and Jamil, Fayaz and Hughes, Cian and Lau, Charles and others , year =. 2507.05201 , archivePrefix =

Pith/arXiv arXiv

[53] [53]

Baichuan-M2: Scaling Medical Capability with Large Verifier System

2025 , eprint =. doi:10.48550/arXiv.2509.02208 , url =

work page doi:10.48550/arxiv.2509.02208 2025

[54] [54]

arXiv preprint arXiv:2508.10925 , year=

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

Pith/arXiv arXiv