RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.
Agentclinic: A multimodal agent benchmark to evaluate ai in simulated clinical environments
9 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
BioMedArena releases a standardized toolkit with 147 biomedical benchmarks, 75 tools, and six harnesses that achieve SOTA results on eight tasks with a +15.03 percentage point average lift.
EndoGov uses specialist agents plus a governance layer with hard and soft rule paths to deliver guideline-compliant endometrial cancer risk stratification, reporting 0.943 accuracy and 0.93% logic-violation rate on TCGA-UCEC while outperforming neural baselines on CPTAC-UCEC.
MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.
Evo-MedAgent adds three evolving memory stores to LLM agents for chest X-ray diagnosis, raising MCQ accuracy from 0.68 to 0.79 on GPT-5-mini and 0.76 to 0.87 on Gemini-3 Flash without any training.
citing papers explorer
-
PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
-
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
-
BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents
BioMedArena releases a standardized toolkit with 147 biomedical benchmarks, 75 tools, and six harnesses that achieve SOTA results on eight tasks with a +15.03 percentage point average lift.