Agentclinic: A multimodal agent benchmark to evaluate ai in simulated clinical environments

Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, Michael Moor · 2024 · arXiv 2405.07960

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

representative citing papers

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

cs.AI · 2026-05-13 · unverdicted · novelty 8.0

RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.

DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

cs.CV · 2026-05-10 · accept · novelty 8.0

DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

cs.AI · 2026-05-04 · conditional · novelty 8.0

PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

cs.AI · 2026-05-07 · conditional · novelty 6.0

BioMedArena releases a standardized toolkit with 147 biomedical benchmarks, 75 tools, and six harnesses that achieve SOTA results on eight tasks with a +15.03 percentage point average lift.

EndoGov: A knowledge-governed multi-agent expert system for endometrial cancer risk stratification

cs.MA · 2026-04-26 · unverdicted · novelty 6.0

EndoGov uses specialist agents plus a governance layer with hard and soft rule paths to deliver guideline-compliant endometrial cancer risk stratification, reporting 0.943 accuracy and 0.93% logic-violation rate on TCGA-UCEC while outperforming neural baselines on CPTAC-UCEC.

MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors

cs.CL · 2026-04-08 · unverdicted · novelty 6.0

MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.

Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve

cs.AI · 2026-04-15 · unverdicted · novelty 5.0

Evo-MedAgent adds three evolving memory stores to LLM agents for chest X-ray diagnosis, raising MCQ accuracy from 0.68 to 0.79 on GPT-5-mini and 0.76 to 0.87 on Gemini-3 Flash without any training.

Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

cs.CL · 2025-08-06

citing papers explorer

Showing 9 of 9 citing papers.

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation cs.AI · 2026-05-13 · unverdicted · none · ref 28
RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.
DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents cs.CV · 2026-05-10 · accept · none · ref 49
DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments cs.AI · 2026-05-04 · conditional · none · ref 30
PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models cs.CL · 2026-04-13 · conditional · none · ref 13
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents cs.AI · 2026-05-07 · conditional · none · ref 37
BioMedArena releases a standardized toolkit with 147 biomedical benchmarks, 75 tools, and six harnesses that achieve SOTA results on eight tasks with a +15.03 percentage point average lift.
EndoGov: A knowledge-governed multi-agent expert system for endometrial cancer risk stratification cs.MA · 2026-04-26 · unverdicted · none · ref 37
EndoGov uses specialist agents plus a governance layer with hard and soft rule paths to deliver guideline-compliant endometrial cancer risk stratification, reporting 0.943 accuracy and 0.93% logic-violation rate on TCGA-UCEC while outperforming neural baselines on CPTAC-UCEC.
MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors cs.CL · 2026-04-08 · unverdicted · none · ref 10
MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.
Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve cs.AI · 2026-04-15 · unverdicted · none · ref 18
Evo-MedAgent adds three evolving memory stores to LLM agents for chest X-ray diagnosis, raising MCQ accuracy from 0.68 to 0.79 on GPT-5-mini and 0.76 to 0.87 on Gemini-3 Flash without any training.
Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models cs.CL · 2025-08-06 · unreviewed · ref 6

Agentclinic: A multimodal agent benchmark to evaluate ai in simulated clinical environments

fields

years

verdicts

representative citing papers

citing papers explorer