NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.
hub Baseline reference
PathVQA: 30000+ Questions for Medical Visual Question Answering
Baseline reference. 55% of citing Pith papers use this work as a benchmark or comparison.
abstract
Is it possible to develop an "AI Pathologist" to pass the board-certified examination of the American Board of Pathology? To achieve this goal, the first step is to create a visual question answering (VQA) dataset where the AI agent is presented with a pathology image together with a question and is asked to give the correct answer. Our work makes the first attempt to build such a dataset. Different from creating general-domain VQA datasets where the images are widely accessible and there are many crowdsourcing workers available and capable of generating question-answer pairs, developing a medical VQA dataset is much more challenging. First, due to privacy concerns, pathology images are usually not publicly available. Second, only well-trained pathologists can understand pathology images, but they barely have time to help create datasets for AI research. To address these challenges, we resort to pathology textbooks and online digital libraries. We develop a semi-automated pipeline to extract pathology images and captions from textbooks and generate question-answer pairs from captions using natural language processing. We collect 32,799 open-ended questions from 4,998 pathology images where each question is manually checked to ensure correctness. To our best knowledge, this is the first dataset for pathology VQA. Our dataset will be released publicly to promote research in medical VQA.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MI-CXR is a new benchmark that shows state-of-the-art vision-language models achieve only 29.3% accuracy on longitudinal reasoning tasks across multi-visit chest X-ray sequences.
DALPHIN benchmark finds the pathology-specific AI copilot PathChat+ shows no statistically significant difference from expert pathologists in 4 of 6 tasks, with general models matching in 1-2 tasks, on a diverse open dataset released for ongoing evaluation.
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
MedFlowBench evaluates VLM agents on full radiology and pathology studies by requiring both task answers and verifiable evidence like key slices and regions of interest, revealing that answer-only scores overestimate performance.
Fetal-Gauge benchmark shows state-of-the-art vision-language models reach only 55% accuracy on fetal ultrasound tasks, well below clinical needs and highlighting the requirement for domain-adapted models.
DDX-TRACE is a physician-adjudicated benchmark for evaluating VLMs on evidence-supported diagnostic trajectories rather than final answers alone in multimodal neuroradiology.
BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.
PlantInquiryVQA shows multimodal LLMs describe plant symptoms but struggle with clinical reasoning and diagnosis, with structured Chain of Inquiry improving correctness and reducing hallucinations.
KIRA is a unified architecture for visual RAG that reports 0.97 retrieval precision, 1.0 grounding, and 0.707 domain correctness across medical, circuit, satellite, and histopathology domains via hierarchical chunking, dual-path retrieval, and evidence-conditioned generation.
A new multi-frame VQA benchmark on volumetric MRI demonstrates that bounding-box supervised fine-tuning improves spatial grounding in VLMs over zero-shot baselines.
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
MEDSYN benchmark shows MLLMs match experts on differential diagnosis lists but have much larger gaps to final diagnosis selection than humans, due to text overreliance and cross-modal evidence gaps.
IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.
Neural-MedBench reveals sharp performance drops in state-of-the-art VLMs on reasoning-intensive neurology tasks compared to conventional classification benchmarks, with reasoning failures dominating errors.
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.
PathNavigate introduces a scan-search-readout routine with surprise-guided low-mag scanning and shared slide memory to improve training-free WSI-VQA accuracy and efficiency.
Existing visual attribution methods often fail to identify the visual evidence used by LVLMs in chest X-ray reasoning, while MedFocus using unbalanced optimal transport and targeted interventions substantially outperforms them across multiple models and settings.
Introduces Wasserstein equilibrium decoding that improves accuracy and convergence speed for small VLMs on medical VQA benchmarks by using semantic consensus instead of lexical order.
Introduces BanglaMedVQA dataset of clinically validated image-question-answer pairs and benchmarks foundation models, finding substantially lower performance than on English MedVQA especially on diagnostic questions.
VISTA uses prefix resampling and a vision-aware attention score to address data imbalance and language prior bias in self-improvement training of MLLMs, yielding up to 13.66% gains on reasoning tasks.
Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards for cancer screening.
citing papers explorer
-
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
MEDSYN benchmark shows MLLMs match experts on differential diagnosis lists but have much larger gaps to final diagnosis selection than humans, due to text overreliance and cross-modal evidence gaps.
-
How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking
Introduces BanglaMedVQA dataset of clinically validated image-question-answer pairs and benchmarks foundation models, finding substantially lower performance than on English MedVQA especially on diagnostic questions.
-
Dialectic-Med: Mitigating Diagnostic Hallucinations via Counterfactual Adversarial Multi-Agent Debate
Dialectic-Med uses proponent-opponent-mediator agents with visual falsification to enforce grounded diagnostic reasoning in MLLMs, achieving SOTA accuracy and reduced hallucinations on MIMIC-CXR-VQA, VQA-RAD, and PathVQA.
-
Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs
A Medical Entity Tree organizes medical knowledge to engineer higher-quality training data that boosts general MLLMs on medical benchmarks.
-
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.