hub

PubMedQA: A Dataset for Biomedical Research Question Answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, Xinghua Lu · 2019 · DOI 10.18653/v1/d19-1259

35 Pith papers cite this work. Polarity classification is still indexing.

35 Pith papers citing it

open at publisher browse 35 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 2 dataset 1

citation-polarity summary

unclear 2 use dataset 1

representative citing papers

MedHal-Loc: Are "Explainable-by-Architecture" Medical Hallucination Detectors Faithful Localizers? A Localization Benchmark

cs.CL · 2026-06-19 · unverdicted · novelty 7.0

MedHal-Loc benchmark shows KG-triple hallucination detectors localize errors no better than chance on controlled medical statements due to entity extraction limits, while NLI and consistency methods succeed above chance, and real hallucinations are mostly diffuse conclusion changes.

Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

AuthorityBench shows citation presence (real or fabricated) increases LLM hallucination rates vs no-citation baseline, strongest for fabricated citations on true claims, with domain variation but negligible venue or author effects.

Can AI Agents Synthesize Scientific Conclusions?

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

cs.AI · 2026-06-01 · conditional · novelty 7.0

AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.

JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

cs.CV · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

JMed48k is a new benchmark of Japanese healthcare licensing exams used to evaluate 21 VLMs, with a paired image-removal audit revealing large differences in how models and professions benefit from visual content.

Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

cs.CL · 2026-05-12 · conditional · novelty 7.0 · 2 refs

Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.

BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets

cs.CL · 2026-04-28 · conditional · novelty 7.0

A graphlet-anchored framework generates 119,856 factually grounded biomedical QA pairs that improve accuracy on PubMedQA and MedQA benchmarks.

PaperMind: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs

cs.IR · 2026-04-23 · unverdicted · novelty 7.0

PaperMind is a new benchmark that evaluates integrated multimodal reasoning and critique over scientific papers through four complementary task families across seven domains.

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

cs.CL · 2024-02-05 · unverdicted · novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

cs.DC · 2026-07-01 · unverdicted · novelty 6.0

ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving by routing decode requests via prefill-derived expert signatures and K-means locality partitioning over load-balancing baselines.

BioHarness: Substrate-Aware Evidence Assembly for Biomedical Question Answering across Literature, Knowledge Bases, and Biological Atlases

q-bio.QM · 2026-06-17 · unverdicted · novelty 6.0

BioHarness improves pooled biomedical QA score from 65.9 to 71.0 on 19,302 items by using staged, substrate-aware evidence assembly that escalates only when needed.

DrugBench: Evaluating AI Control Protocols for Medication Harm Mitigation

cs.AI · 2026-06-10 · unverdicted · novelty 6.0

DrugBench evaluates AI control protocols on 3,671 medical conversations for four medication harm types and finds existing protocols subvertible, proposing severity-based monitoring instead.

TaDA: Calibrated Probe Gating for Task-Domain LoRA Merging

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

TaDA merges task-domain LoRAs via calibrated per-layer gating and subspace-aware merging, reaching 0.452 avg accuracy on six scientific QA benchmarks and 85.9% on six image classification benchmarks.

Search-Time Contamination in Deep Research Agents: Measuring Performance Inflation in Public Benchmark Evaluation

cs.CR · 2026-06-03 · unverdicted · novelty 6.0

Deep research agents exhibit widespread search-time contamination on six public benchmarks, with three defined leakage types inflating performance by up to 4%.

ANN Search: Recall What Matters

cs.IR · 2026-06-03 · conditional · novelty 6.0

ANN search quality is better assessed by 1/Ratio@k than Recall@k because the former tracks downstream task utility more closely while allowing substantially lower computational cost.

When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG

cs.CL · 2026-06-02 · accept · novelty 6.0

Large-scale evaluation shows retrieval-augmented generation yields only marginal and inconsistent gains (1-2 points) over no-retrieval baselines in biomedical QA, with model choice dominating retriever or corpus effects.

Graph Alignment Topology as an Inductive Bias for Grounding Detection

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

A GNN trained on bipartite alignment graphs between references and LLM generations reports state-of-the-art hallucination detection across four datasets, beating prior methods and GPT-4o.

ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

ChronoMedKG builds a temporal biomedical KG with 460k evidence-linked triples across 13k diseases using LLM consensus and introduces the ChronoTQA benchmark showing RAG gains on time-sensitive questions.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

Building evidence-based knowledge bases from full-text literature for disease-specific biomedical reasoning

cs.CE · 2026-03-30 · unverdicted · novelty 6.0

EvidenceNet releases disease-specific biomedical knowledge bases with 7,872 and 6,622 evidence records for HCC and CRC, plus graphs, extracted via LLM pipeline with reported high fidelity.

Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models

cs.CL · 2025-02-20 · unverdicted · novelty 6.0

Adapts multi-layer token-level Mahalanobis distance with supervised linear regression to yield improved uncertainty scores for LLM truthfulness tasks.

Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models

cs.CL · 2024-08-20 · unverdicted · novelty 6.0

A regression model using attention features and recurrent uncertainty scores improves selective generation in LLMs over unsupervised and supervised baselines on ten datasets and three models.

DataComp-LM: In search of the next generation of training sets for language models

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

Can AI-Generated Text be Reliably Detected?

cs.CL · 2023-03-17 · unverdicted · novelty 6.0

Recursive paraphrasing attacks substantially lower detection rates for multiple AI text detectors with only minor quality loss, while a theoretical analysis ties best-case AUROC to total variation distance between human and AI distributions.

citing papers explorer

Showing 2 of 2 citing papers after filters.

JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation cs.CV · 2026-05-21 · unverdicted · none · ref 22 · 2 links
JMed48k is a new benchmark of Japanese healthcare licensing exams used to evaluate 21 VLMs, with a paired image-removal audit revealing large differences in how models and professions benefit from visual content.
UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA cs.CV · 2026-06-10 · unverdicted · none · ref 192
UniReason-Med introduces a unified framework for 2D and 3D medical VQA with shared grounded reasoning, trained on a 220K dataset, claiming that joint 2D+3D supervision improves 3D performance over 3D-only training.

PubMedQA: A Dataset for Biomedical Research Question Answering

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer