hub

PubMedQA: A Dataset for Biomedical Research Question Answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, Xinghua Lu · 2019 · cs.CL · arXiv 1909.06146

31 Pith papers cite this work. Polarity classification is still indexing.

31 Pith papers citing it

open full Pith review browse 31 citing papers arXiv PDF

abstract

We introduce PubMedQA, a novel biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts. PubMedQA has 1k expert-annotated, 61.2k unlabeled and 211.3k artificially generated QA instances. Each PubMedQA instance is composed of (1) a question which is either an existing research article title or derived from one, (2) a context which is the corresponding abstract without its conclusion, (3) a long answer, which is the conclusion of the abstract and, presumably, answers the research question, and (4) a yes/no/maybe answer which summarizes the conclusion. PubMedQA is the first QA dataset where reasoning over biomedical research texts, especially their quantitative contents, is required to answer the questions. Our best performing model, multi-phase fine-tuning of BioBERT with long answer bag-of-word statistics as additional supervision, achieves 68.1% accuracy, compared to single human performance of 78.0% accuracy and majority-baseline of 55.2% accuracy, leaving much room for improvement. PubMedQA is publicly available at https://pubmedqa.github.io.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 1

citation-polarity summary

background 2 use dataset 1

representative citing papers

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization

cs.CL · 2026-06-11 · unverdicted · novelty 7.0

Fine-tuned Mistral-7B via QLoRA achieves up to 12% higher F1 than GPT-4o on biomedical claim verification with 1008 examples, identifies a structural shortcut in SciFact, and shows robust cross-domain transfer from sound data.

ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

ClinicalMC is a benchmark of 1,275 Chinese and 5,804 English multi-course clinical samples across four stages, evaluated via a multi-agent framework on closed-source, open-source, and medical LLMs in static and dynamic settings.

IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

cs.CV · 2026-01-06 · conditional · novelty 7.0

IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

cs.CL · 2024-02-05 · unverdicted · novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.

Capabilities of GPT-4 on Medical Challenge Problems

cs.CL · 2023-03-20 · unverdicted · novelty 7.0

GPT-4 exceeds the USMLE passing score by more than 20 points and outperforms both GPT-3.5 and the medically fine-tuned Med-PaLM on the MultiMedQA benchmarks.

The Score Granularity Gap in Black-Box LLM Classification: A Comparative Study of Confidence Constructions

cs.CL · 2026-06-20 · unverdicted · novelty 6.0

Comparative evaluation of seven confidence constructions across 25 LLM-dataset pairs reveals that verbalized scores provide good ranking but coarse granularity for thresholding, while multi-query aggregation helps weak models but can harm strong ones.

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

OPD-Evolver uses on-policy self-distillation in fast interaction and slow attribution loops to build agents with holistic memory competence, outperforming prior systems by up to 11.5% and allowing a 9B model to compete with much larger ones.

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.

Single-Rollout Hidden-State Dynamics for Training-Free RLVR Data Selection

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

SHIFT selects compact RLVR training subsets using the magnitude of hidden-state change from a single inference rollout plus quality-weighted farthest-first coverage, outperforming training-free baselines on math reasoning and medical QA under low budgets.

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.

PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

cs.CR · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

PragLocker generates function-preserving but non-portable prompts for LLM agents via code-symbol semantic anchoring followed by target-model feedback noise injection.

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

cs.AI · 2026-04-14 · unverdicted · novelty 6.0

Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.

MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors

cs.CL · 2026-04-08 · unverdicted · novelty 6.0

MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.

LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

cs.AI · 2026-02-04 · unverdicted · novelty 6.0

LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.

Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting

cs.CL · 2025-10-20 · unverdicted · novelty 6.0

Attention-Shifting uses importance-aware suppression on unlearning data and retention enhancement on retained data via dual-loss optimization to achieve selective unlearning with better utility preservation than prior methods.

DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks

cs.LG · 2025-02-01 · unverdicted · novelty 6.0

DUET is a global-to-local method that optimizes LLM training data mixtures via Bayesian optimization guided by influence-based selection and feedback from unseen evaluation tasks, with a regret bound showing convergence to the optimal mixture.

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

cs.CL · 2024-12-25 · unverdicted · novelty 6.0

HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.

An Empirical Study of Mamba-based Language Models

cs.LG · 2024-06-12 · accept · novelty 6.0

An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.

Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering

cs.CL · 2026-06-30 · unverdicted · novelty 5.0 · 2 refs

BiRG-LoRA achieves 69.31% macro-average accuracy across CMB, CMExam, MedQA, and MedMCQA, outperforming MoELoRA by 0.89 points with 28.1% fewer trainable parameters under a matched Qwen3-8B protocol.

Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support

cs.AI · 2026-05-21 · unverdicted · novelty 5.0

Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% versus full-context evaluation in a new OSCE-inspired benchmark across 468 cases and 15 models.

Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation

cs.CL · 2026-05-21 · unverdicted · novelty 5.0

Claim-selective certification decomposes medical RAG responses into verifiable claims scored against retrieved evidence and mapped via an intent-aware selector to actions, reporting zero UCCR and action accuracy of 0.92 on dev and 0.90 on test.

VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering

cs.IR · 2026-01-16 · unverdicted · novelty 5.0

VerifAI is an open-source biomedical QA system that decomposes generated answers into claims and verifies them with a fine-tuned NLI engine to reduce hallucinations and provide traceable citations.

CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment

cs.CR · 2024-10-16 · unverdicted · novelty 5.0

CoreGuard introduces a computation- and communication-efficient protocol claimed to deliver upper-bound security against model stealing for edge-deployed LLMs with negligible overhead.

citing papers explorer

Showing 6 of 6 citing papers after filters.

ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models cs.AI · 2026-06-02 · unverdicted · none · ref 30 · internal anchor
ClinicalMC is a benchmark of 1,275 Chinese and 5,804 English multi-course clinical samples across four stages, evaluated via a multi-agent framework on closed-source, open-source, and medical LLMs in static and dynamic settings.
SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment cs.AI · 2026-06-01 · unverdicted · none · ref 85 · internal anchor
SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.
Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints cs.AI · 2026-04-14 · unverdicted · none · ref 21 · internal anchor
Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
LABBench2: An Improved Benchmark for AI Systems Performing Biology Research cs.AI · 2026-02-04 · unverdicted · none · ref 11 · internal anchor
LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.
Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support cs.AI · 2026-05-21 · unverdicted · none · ref 5 · internal anchor
Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% versus full-context evaluation in a new OSCE-inspired benchmark across 468 cases and 15 models.
MedGemma 1.5 Technical Report cs.AI · 2026-04-06 · unverdicted · none · ref 10 · internal anchor
MedGemma 1.5 4B reports absolute gains of 11% on 3D MRI classification, 3% on 3D CT, 47% macro F1 on pathology slides, 35% IoU on anatomical localization, and 5-22% on clinical QA tasks over MedGemma 1.

PubMedQA: A Dataset for Biomedical Research Question Answering

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer