hub

What disease does this patient have? a large-scale open domain question answering dataset from medical exams

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, Peter Szolovits · 2020 · arXiv 2009.13081

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

cs.CL · 2026-05-12 · conditional · novelty 7.0 · 2 refs

Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.

BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets

cs.CL · 2026-04-28 · conditional · novelty 7.0

A graphlet-anchored framework generates 119,856 factually grounded biomedical QA pairs that improve accuracy on PubMedQA and MedQA benchmarks.

SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning performance within 1.5 points.

Crowded in B-Space: Calibrating Shared Directions for LoRA Merging

cs.CL · 2026-04-18 · unverdicted · novelty 7.0

Pico reduces LoRA merge interference by calibrating over-shared directions in the B matrix before merging, yielding 3.4-8.3 point accuracy gains and sometimes beating joint training.

Query-efficient model evaluation using cached responses

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.

MedExAgent: Training LLM Agents to Ask, Examine, and Diagnose in Noisy Clinical Environments

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

MedExAgent models clinical diagnosis as a POMDP with patient and exam noise, then uses supervised fine-tuning followed by DAPO optimization to train an agent that matches larger models on diagnostic accuracy while controlling exam costs.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

Exclusive Unlearning

cs.CL · 2026-04-07 · unverdicted · novelty 6.0

Exclusive Unlearning makes LLMs safe by forgetting all but retained domain knowledge, protecting against jailbreaks while preserving useful responses in areas like medicine and math.

MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs

cs.CL · 2026-05-05 · unverdicted · novelty 5.0

MedFabric dataset and EtHER detector achieve over 15% better word-level fabrication detection in medical LLMs than prior methods by generating stylistically faithful errors and using decomposition-based checking.

From Exposure to Internalization: Dual-Stream Calibration for In-context Clinical Reasoning

q-bio.QM · 2026-04-07 · unverdicted · novelty 5.0

Dual-Stream Calibration uses entropy minimization and iterative meta-learning at test time to internalize clinical evidence and outperform standard in-context learning baselines on medical tasks.

Medical Incident Causal Factors and Preventive Measures Generation Using Tag-based Example Selection in Few-shot Learning

cs.CL · 2026-05-11 · unverdicted · novelty 4.0

Tag-based few-shot selection yields higher precision and stability than random or similarity-based methods when using LLMs to analyze medical incidents.

A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering

cs.CL · 2026-04-08 · unverdicted · novelty 4.0

Dense retrieval plus query reformulation and reranking reaches 60.49% accuracy on MedQA USMLE, outperforming other setups while domain-specialized models make better use of the retrieved evidence.

citing papers explorer

Showing 13 of 13 citing papers.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence cs.CL · 2026-05-13 · accept · none · ref 13
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation cs.CL · 2026-05-12 · conditional · none · ref 9 · 2 links
Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets cs.CL · 2026-04-28 · conditional · none · ref 14
A graphlet-anchored framework generates 119,856 factually grounded biomedical QA pairs that improve accuracy on PubMedQA and MedQA benchmarks.
SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models cs.LG · 2026-04-20 · unverdicted · none · ref 26
SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning performance within 1.5 points.
Crowded in B-Space: Calibrating Shared Directions for LoRA Merging cs.CL · 2026-04-18 · unverdicted · none · ref 14
Pico reduces LoRA merge interference by calibrating over-shared directions in the B matrix before merging, yielding 3.4-8.3 point accuracy gains and sometimes beating joint training.
Query-efficient model evaluation using cached responses cs.LG · 2026-05-08 · unverdicted · none · ref 165
DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.
MedExAgent: Training LLM Agents to Ask, Examine, and Diagnose in Noisy Clinical Environments cs.CL · 2026-05-08 · unverdicted · none · ref 17
MedExAgent models clinical diagnosis as a POMDP with patient and exam noise, then uses supervised fine-tuning followed by DAPO optimization to train an agent that matches larger models on diagnostic accuracy while controlling exam costs.
Compared to What? Baselines and Metrics for Counterfactual Prompting cs.CL · 2026-05-01 · conditional · none · ref 3
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
Exclusive Unlearning cs.CL · 2026-04-07 · unverdicted · none · ref 5
Exclusive Unlearning makes LLMs safe by forgetting all but retained domain knowledge, protecting against jailbreaks while preserving useful responses in areas like medicine and math.
MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs cs.CL · 2026-05-05 · unverdicted · none · ref 11
MedFabric dataset and EtHER detector achieve over 15% better word-level fabrication detection in medical LLMs than prior methods by generating stylistically faithful errors and using decomposition-based checking.
From Exposure to Internalization: Dual-Stream Calibration for In-context Clinical Reasoning q-bio.QM · 2026-04-07 · unverdicted · none · ref 48
Dual-Stream Calibration uses entropy minimization and iterative meta-learning at test time to internalize clinical evidence and outperform standard in-context learning baselines on medical tasks.
Medical Incident Causal Factors and Preventive Measures Generation Using Tag-based Example Selection in Few-shot Learning cs.CL · 2026-05-11 · unverdicted · none · ref 26
Tag-based few-shot selection yields higher precision and stability than random or similarity-based methods when using LLMs to analyze medical incidents.
A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering cs.CL · 2026-04-08 · unverdicted · none · ref 7
Dense retrieval plus query reformulation and reranking reaches 60.49% accuracy on MedQA USMLE, outperforming other setups while domain-specialized models make better use of the retrieved evidence.

What disease does this patient have? a large-scale open domain question answering dataset from medical exams

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer