MedicalAgentsBench for Complex Medical Reasoning: Comparing Internalized Reasoning Models versus Externalized Agent-based Frameworks

Arman Cohan; Chenglin Wu; Fang Wu; Jiapeng Chen; Jiayi Zhang; Jinyu Xiang; Jiwoong Sohn; Mark Gerstein; Wenqi Shi; Xiangru Tang

arxiv: 2503.07459 · v3 · pith:TAXHFACXnew · submitted 2025-03-10 · 💻 cs.CL · cs.AI

MedicalAgentsBench for Complex Medical Reasoning: Comparing Internalized Reasoning Models versus Externalized Agent-based Frameworks

Yanjun Shao , Xiangru Tang , Jiwoong Sohn , Jiapeng Chen , Yuxuan Liao , Jiayi Zhang , Jinyu Xiang , Fang Wu

show 5 more authors

Yilun Zhao Chenglin Wu Wenqi Shi Arman Cohan Mark Gerstein

This is my paper

classification 💻 cs.CL cs.AI

keywords reasoninginternalizedmodelsexternalizedcomplexmedicalmedicalagentsbenchagent

0 comments

read the original abstract

Complex medical reasoning requires integrating heterogeneous clinical evidence across multiple inference steps. Large language models (LLMs) now approach this through two routes: internalized reasoning and externalized agent scaffolding (frameworks that decompose problems collaboratively amongst multiple LLMs). To determine whether these routes are exclusive or complementary, we introduce MedicalAgentsBench, a filtered benchmark of 862 complex clinical questions drawn from the union of eight medical datasets via difficulty-aware curation and contamination screening. Evaluating three internalized reasoning models (DeepSeek-R1, o1-mini, and o3-mini), seven base models, and nine externalized agent-based methods, we find that internalized and externalized approaches each independently improve performance, and that their benefits compound: the highest accuracy is achieved by layering agent workflows onto an internalized reasoning model (i.e., o3-mini + MDAgents with 35.1%). Pareto analysis shows this combination dominates the cost-performance frontier; moreover, lightweight optimization on inexpensive models offers an entry point for resource-constrained settings. Our benchmark is at https://github.com/gersteinlab/MedicalAgentsBench.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
cs.CV 2025-09 unverdicted novelty 7.0

Neural-MedBench reveals sharp performance drops in state-of-the-art VLMs on reasoning-intensive neurology tasks compared to conventional classification benchmarks, with reasoning failures dominating errors.
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
cs.CV 2026-05 conditional novelty 6.0

MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?
cs.CL 2026-05 unverdicted novelty 6.0

CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.
Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks
cs.CL 2026-05 unverdicted novelty 6.0

Medmarks introduces 30 open benchmarks for medical LLM tasks and evaluates 61 models, finding frontier reasoning models lead while medically fine-tuned ones outperform generalists and all show answer-order bias.
MultiDx: A Multi-Source Knowledge Integration Framework towards Diagnostic Reasoning
cs.CL 2026-04 unverdicted novelty 5.0

MultiDx integrates evidence from web search, SOAP cases, and clinical databases in a two-stage process to improve LLM diagnostic reasoning and alignment with clinical trajectories.
From Exposure to Internalization: Dual-Stream Calibration for In-context Clinical Reasoning
q-bio.QM 2026-04 unverdicted novelty 5.0

Dual-Stream Calibration uses entropy minimization and iterative meta-learning at test time to internalize clinical evidence and outperform standard in-context learning baselines on medical tasks.
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
cs.CL 2026-02 unverdicted novelty 4.0

MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
cs.AI 2025-04 accept novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
cs.CV 2025-03 unverdicted novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.