hub Canonical reference

Capabilities of GPT-4 on Medical Challenge Problems

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, Eric Horvitz · 2023 · cs.CL · arXiv 2303.13375

Canonical reference. 100% of citing Pith papers cite this work as background.

62 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 62 citing papers arXiv PDF

abstract

Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation across various domains, including medicine. We present a comprehensive evaluation of GPT-4, a state-of-the-art LLM, on medical competency examinations and benchmark datasets. GPT-4 is a general-purpose model that is not specialized for medical problems through training or engineered to solve clinical tasks. Our analysis covers two sets of official practice materials for the USMLE, a three-step examination program used to assess clinical competency and grant licensure in the United States. We also evaluate performance on the MultiMedQA suite of benchmark datasets. Beyond measuring model performance, experiments were conducted to investigate the influence of test questions containing both text and images on model performance, probe for memorization of content during training, and study probability calibration, which is of critical importance in high-stakes applications like medicine. Our results show that GPT-4, without any specialized prompt crafting, exceeds the passing score on USMLE by over 20 points and outperforms earlier general-purpose models (GPT-3.5) as well as models specifically fine-tuned on medical knowledge (Med-PaLM, a prompt-tuned version of Flan-PaLM 540B). In addition, GPT-4 is significantly better calibrated than GPT-3.5, demonstrating a much-improved ability to predict the likelihood that its answers are correct. We also explore the behavior of the model qualitatively through a case study that shows the ability of GPT-4 to explain medical reasoning, personalize explanations to students, and interactively craft new counterfactual scenarios around a medical case. Implications of the findings are discussed for potential uses of GPT-4 in medical education, assessment, and clinical practice, with appropriate attention to challenges of accuracy and safety.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

cs.CL · 2026-06-14 · unverdicted · novelty 8.0

EHRNote-ChatQA is the first benchmark for evidence-grounded multi-turn clinical QA over longitudinal discharge summaries, containing 16,072 medical-expert-verified pairs across eight categories and revealing LLM weaknesses in evidence grounding and multi-turn consistency.

MedHal-Loc: Are "Explainable-by-Architecture" Medical Hallucination Detectors Faithful Localizers? A Localization Benchmark

cs.CL · 2026-06-19 · unverdicted · novelty 7.0

MedHal-Loc benchmark shows KG-triple hallucination detectors localize errors no better than chance on controlled medical statements due to entity extraction limits, while NLI and consistency methods succeed above chance, and real hallucinations are mostly diffuse conclusion changes.

When LLMs Analyze Scars: From Images to Clinically-Meaningful Features

cs.CV · 2026-06-16 · unverdicted · novelty 7.0

LLMs generate deterministic code to convert scar images into low-dimensional clinical features for classification, claimed to outperform end-to-end deep learning when training data is scarce.

Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

Analysis of 14,727 security and privacy prompts from WildChat finds commercial LLMs give higher-quality responses than open-weight models but can produce inconsistent answers across repeated queries.

SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models

cs.CL · 2026-06-06 · unverdicted · novelty 7.0

SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.

JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

cs.CV · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

JMed48k is a new benchmark of Japanese healthcare licensing exams used to evaluate 21 VLMs, with a paired image-removal audit revealing large differences in how models and professions benefit from visual content.

RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.

EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

cs.AI · 2026-05-10 · conditional · novelty 7.0 · 2 refs

EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.

Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

MED-VRAG reaches 78.6% average accuracy on four medical QA benchmarks by iteratively retrieving PMC page images with ColQwen2.5 embeddings and a VLM that refines queries over up to three rounds.

Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs

cs.CR · 2026-04-20 · unverdicted · novelty 7.0

Indistinguishability-based privacy is incomparable to extractability in LLMs, and a new (l, b)-inextractability definition with rank-based bounds provides a tighter measure of extraction risk than prior proxies.

How people use Copilot for Health

cs.HC · 2026-03-09 · accept · novelty 7.0

Large-scale study of Copilot health queries finds substantial personal and caregiving intent, with time-of-day and device variations plus heavy focus on navigating existing healthcare systems.

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

cs.AI · 2024-10-06 · unverdicted · novelty 7.0

PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.

Cohort-Anchored Foundation Models for Electronic Health Records: From Risk Scores to Auditable Peer Cohorts

cs.LG · 2026-06-20 · unverdicted · novelty 6.0

CAFM is a four-stage framework that anchors EHR foundation models to patient cohorts via deviation-aware curation, cohort-conditioned pretraining, multimodal alignment, and clinician refinement to improve interpretability and trustworthiness.

Automated reproducibility assessments in the social and behavioral sciences using large language models

cs.AI · 2026-06-11 · conditional · novelty 6.0

LLMs match original qualitative conclusions in 80% of 180 studies and effect sizes in 24%, performing similarly to humans in a tested subset, positioning them as a screening tool rather than a full replacement.

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

LLMs drop from 71.1% to 38.0% accuracy on medical questions when misleading context is injected, measured via new MedMisBench benchmark with 10,932 items.

Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Introduces DOSEBENCH benchmark and shows four LLMs often fail at rolling 24-hour dose calculations and constraint adherence in OTC dosing decisions despite appearing confident.

When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG

cs.CL · 2026-06-02 · accept · novelty 6.0

Large-scale evaluation shows retrieval-augmented generation yields only marginal and inconsistent gains (1-2 points) over no-retrieval baselines in biomedical QA, with model choice dominating retriever or corpus effects.

DrugClaw and DrugAudit: A Primary-Source-Grounded Agent and Authority-Aware Benchmark for Drug-Information Question Answering

cs.CL · 2026-05-31 · unverdicted · novelty 6.0

DrugClaw tops benchmarks on primary-source grounding and faithfulness for drug-information QA while DrugAudit provides an authority-aware evaluation set of 3,772 items.

Implicit Geographic Inference in LLM Medical Triage: Language-Driven Disparities in Emergency Recommendations

cs.CL · 2026-05-31 · unverdicted · novelty 6.0

LLMs infer patient location from prompt language, causing ER recommendation rates for identical neurological symptoms to vary from 0% to 30% across languages.

Structured Visual Evidence Decomposition for Evidence-Grounded Multimodal Screening of Obstructive Sleep Apnea-Hypopnea Syndrome

cs.CV · 2026-05-23 · unverdicted · novelty 6.0

EviOSAHS decomposes facial images into seven anatomical evidence cards plus clinical data for LLM-based binary OSAHS screening, reporting 88.47% accuracy and 94.86% sensitivity on 642 subjects while outperforming direct prompting baselines.

CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification

cs.CL · 2026-05-05 · unverdicted · novelty 6.0

CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.

MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

MedSkillAudit is a new domain-specific audit framework for medical research agent skills that achieved moderate agreement with expert reviews (ICC 0.449), exceeding the human inter-rater baseline (ICC 0.300).

The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning

cs.CL · 2026-04-18 · unverdicted · novelty 6.0

HEG-TKG grounds LLM clinical reasoning in hierarchical evidence-based temporal knowledge graphs from 4,512 PubMed records, delivering 100% citation verifiability and error detectability where standard RAG and unprompted LLMs produce none.

MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors

cs.CL · 2026-04-08 · unverdicted · novelty 6.0

MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.

citing papers explorer

Showing 2 of 2 citing papers after filters.

CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification cs.CL · 2026-05-05 · unverdicted · none · ref 5 · internal anchor
CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.
Medical Reasoning with Large Language Models: A Survey and MR-Bench cs.CL · 2026-03-17 · accept · none · ref 2 · internal anchor
LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.

Capabilities of GPT-4 on Medical Challenge Problems

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer