hub

Capabil- ities of GPT-4 on Medical Challenge Problems

Nori, Harsha, King, Nicholas, McKinney, Scott Mayer, Carignan, Dean, Horvitz, Eric · 2023 · arXiv 2303.13375

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

MED-VRAG reaches 78.6% average accuracy on four medical QA benchmarks by iteratively retrieving PMC page images with ColQwen2.5 embeddings and a VLM that refines queries over up to three rounds.

Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs

cs.CR · 2026-04-20 · unverdicted · novelty 7.0

Indistinguishability-based privacy is incomparable to extractability in LLMs, and a new (l, b)-inextractability definition with rank-based bounds provides a tighter measure of extraction risk than prior proxies.

CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification

cs.CL · 2026-05-05 · unverdicted · novelty 6.0

CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.

MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

MedSkillAudit is a new domain-specific audit framework for medical research agent skills that achieved moderate agreement with expert reviews (ICC 0.449), exceeding the human inter-rater baseline (ICC 0.300).

The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning

cs.CL · 2026-04-18 · unverdicted · novelty 6.0

HEG-TKG grounds LLM clinical reasoning in hierarchical evidence-based temporal knowledge graphs from 4,512 PubMed records, delivering 100% citation verifiability and error detectability where standard RAG and unprompted LLMs produce none.

MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors

cs.CL · 2026-04-08 · unverdicted · novelty 6.0

MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.

Building evidence-based knowledge bases from full-text literature for disease-specific biomedical reasoning

cs.CE · 2026-03-30 · unverdicted · novelty 6.0

EvidenceNet releases disease-specific biomedical knowledge bases with 7,872 and 6,622 evidence records for HCC and CRC, plus graphs, extracted via LLM pipeline with reported high fidelity.

Towards an AI co-scientist

cs.AI · 2025-02-26 · unverdicted · novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale

cs.CL · 2026-04-26 · conditional · novelty 5.0

Domain fine-tuning of a 4B LLM yields a statistically significant 6.8 pp accuracy gain on MedQA-USMLE over a general baseline, while RAG over medical explanations produces no significant improvement.

VeriLLMed: Interactive Visual Debugging of Medical Large Language Models with Knowledge Graphs

cs.CL · 2026-04-25 · unverdicted · novelty 5.0

VeriLLMed is an interactive visual debugging tool that maps LLM diagnostic reasoning to knowledge graphs to identify and categorize relation, branch, and missing errors.

EviCare: Enhancing Diagnosis Prediction with Deep Model-Guided Evidence for In-Context Reasoning

cs.CL · 2026-04-12 · unverdicted · novelty 5.0

EviCare uses deep model-guided evidence to enhance LLM in-context reasoning for accurate diagnosis prediction from EHRs, outperforming baselines by 20.65% on average and 30.97% for novel diagnoses on MIMIC datasets.

Teaching LLMs Brazilian Healthcare: Injecting Knowledge from Official Clinical Guidelines

cs.CL · 2026-05-01 · unverdicted · novelty 4.0

A 14B model trained on synthetic data from Brazilian clinical guidelines outperforms larger LLMs on new benchmarks for Brazilian healthcare protocols.

AI Identification: An Integrated Framework for Sustainable Governance in Digital Enterprises

cs.CR · 2026-04-12 · unverdicted · novelty 4.0

The paper introduces a dual-layer AI identification framework that integrates cryptographic, blockchain, and zero-knowledge techniques with governance checkpoints to support lifecycle accountability in digital enterprises.

A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering

cs.CL · 2026-04-08 · unverdicted · novelty 4.0

Dense retrieval plus query reformulation and reranking reaches 60.49% accuracy on MedQA USMLE, outperforming other setups while domain-specialized models make better use of the retrieved evidence.

Comparative Analysis of Large Language Models in Healthcare

cs.CL · 2026-04-11 · unverdicted · novelty 3.0

Domain-specific models like ChatDoctor excel at medically accurate and contextually reliable text while general-purpose models like Grok and LLaMA perform better on structured medical question-answering tasks.

EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

cs.AI · 2026-05-10

citing papers explorer

Showing 1 of 1 citing paper after filters.

Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale cs.CL · 2026-04-26 · conditional · none · ref 6
Domain fine-tuning of a 4B LLM yields a statistically significant 6.8 pp accuracy gain on MedQA-USMLE over a general baseline, while RAG over medical explanations produces no significant improvement.

Capabil- ities of GPT-4 on Medical Challenge Problems

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer