RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.
hub Canonical reference
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments
Canonical reference. 88% of citing Pith papers cite this work as background.
abstract
Evaluating large language models (LLM) in clinical scenarios is crucial to assessing their potential clinical utility. Existing benchmarks rely heavily on static question-answering, which does not accurately depict the complex, sequential nature of clinical decision-making. Here, we introduce AgentClinic, a multimodal agent benchmark for evaluating LLMs in simulated clinical environments that include patient interactions, multimodal data collection under incomplete information, and the usage of various tools, resulting in an in-depth evaluation across nine medical specialties and seven languages. We find that solving MedQA problems in the sequential decision-making format of AgentClinic is considerably more challenging, resulting in diagnostic accuracies that can drop to below a tenth of the original accuracy. Overall, we observe that agents sourced from Claude-3.5 outperform other LLM backbones in most settings. Nevertheless, we see stark differences in the LLMs' ability to make use of tools, such as experiential learning, adaptive retrieval, and reflection cycles. Strikingly, Llama-3 shows up to 92% relative improvements with the notebook tool that allows for writing and editing notes that persist across cases. To further scrutinize our clinical simulations, we leverage real-world electronic health records, perform a clinical reader study, perturb agents with biases, and explore novel patient-centric metrics that this interactive environment firstly enables.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
SkeMex distills agent trajectories into value-aware skills organized in general/task/action branches and evolves them via a closed-loop Read-Write-Assess-Govern process, outperforming prior memory agents on clinical tasks.
MedSP1000 benchmark shows top LLMs complete at most 60.4% of expert rubric items during multi-turn standardized patient simulations.
AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
BioMedArena supplies a standardized open toolkit with 166 biomedical benchmarks, 75 tools, 6 harnesses, and 6 context strategies that improve 12 backbones and surpass prior SOTA by 15.01 points on average across 8 benchmarks.
HealthAgentBench is a new benchmark of 54 healthcare agent tasks where even the strongest frontier AI agent reaches only about 42% success rate on end-to-end clinical workflows.
MESA ranks MAS communication edges by vulnerability via graph-theoretic metrics and dynamic probes, achieving mean Spearman ρ=+0.60 correlation with empirical per-edge attack success and 3x interception gain when monitoring the top 10%.
MedGuards introduces a multi-agent in-context learning framework for medical error detection and correction plus the KPCS metric, reporting improvements on four multilingual clinical note datasets.
LLMs drop from 71.1% to 38.0% accuracy on medical questions when misleading context is injected, measured via new MedMisBench benchmark with 10,932 items.
A survey of RLM use in 28 disciplines reveals uneven adoption and introduces a maturity assessment framework showing larger gaps when limited to public resources.
EHRBench uses an EHR-LLM-KB pipeline to automatically create 960,067 reliable QA items spanning diagnosis, treatment, and prognosis for large-scale LLM evaluation in clinical decision making.
PathNavigate introduces a scan-search-readout routine with surprise-guided low-mag scanning and shared slide memory to improve training-free WSI-VQA accuracy and efficiency.
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.
ClinSeekAgent automates active multimodal evidence seeking for clinical reasoning, improving LLM performance on raw EHR and CXR tasks while enabling distillation into smaller models.
CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.
EndoGov uses specialist agents plus a governance layer with hard and soft rule paths to deliver guideline-compliant endometrial cancer risk stratification, reporting 0.943 accuracy and 0.93% logic-violation rate on TCGA-UCEC while outperforming neural baselines on CPTAC-UCEC.
MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.
MedCheck is a lifecycle checklist framework that audits 53 existing medical LLM benchmarks and identifies systemic gaps in clinical fidelity, contamination control, and safety metrics.
RDMA equips small LLMs with abbreviation resolution, phenotype reasoning, and ontology tools to mine rare diseases from EHR notes, outperforming fine-tuned and RAG baselines at up to 10x lower inference cost.
citing papers explorer
-
DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents
DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
-
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
-
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
-
EndoGov: A knowledge-governed multi-agent expert system for endometrial cancer risk stratification
EndoGov uses specialist agents plus a governance layer with hard and soft rule paths to deliver guideline-compliant endometrial cancer risk stratification, reporting 0.943 accuracy and 0.93% logic-violation rate on TCGA-UCEC while outperforming neural baselines on CPTAC-UCEC.
-
AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery
A survey organizing AI-powered research automation into five workflow stages, defining AutoResearch and Vibe Research, and proposing five evaluation dimensions while noting domain-conditioned limits on autonomy.
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.