AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Carl Harris; Eduardo Reis; Jeffrey Jopling; Michael Moor; Rojin Ziaei; Samuel Schmidgall

arxiv: 2405.07960 · v5 · pith:RRV4GFQOnew · submitted 2024-05-13 · 💻 cs.HC · cs.CL

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Samuel Schmidgall , Rojin Ziaei , Carl Harris , Eduardo Reis , Jeffrey Jopling , Michael Moor This is my paper

Pith reviewed 2026-05-21 13:48 UTC · model grok-4.3

classification 💻 cs.HC cs.CL

keywords AgentClinicLLM agentsclinical benchmarkssequential decision makingmultimodal medical evaluationdiagnostic accuracytool use in AI

0 comments

The pith

Sequential clinical simulations cut LLM diagnostic accuracy to below one-tenth of static MedQA levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AgentClinic, a benchmark that places large language models into simulated clinical environments where they must interact with patients, collect multimodal data under incomplete information, and employ tools to reach diagnoses across nine specialties and seven languages. It shows that converting standard MedQA questions into this interactive, sequential format makes the problems substantially harder, with accuracy falling sharply for all tested models. Agents powered by Claude-3.5 generally lead in performance, yet the study also documents large differences in how effectively different models exploit available tools, including a notebook that lets Llama-3 retain information across cases. Validation steps using real electronic health records, a clinical reader study, and bias perturbations support the claim that these simulations better reflect the uncertainties of actual medical decision-making.

Core claim

Solving MedQA problems inside the sequential decision-making format of AgentClinic is considerably more challenging than static question-answering, resulting in diagnostic accuracies that can drop to below a tenth of the original accuracy. Agents sourced from Claude-3.5 outperform other LLM backbones in most settings, while stark differences appear in the ability to use tools such as experiential learning, adaptive retrieval, and reflection cycles. Llama-3 exhibits up to 92 percent relative improvement when given a notebook tool that allows writing and editing notes that persist across cases.

What carries the argument

AgentClinic simulated clinical environment, which requires agents to conduct patient interactions, gather multimodal data under incomplete information, and apply various tools in a sequential diagnostic process.

If this is right

Static medical QA benchmarks substantially overestimate how well current LLMs will perform in live clinical workflows.
Tool-use proficiency, especially persistent note-taking, can produce large relative gains for some model families but not others.
Model rankings shift when evaluation moves from isolated questions to interactive, information-gathering loops.
Patient-centric metrics that become measurable only in an interactive setting provide new ways to assess clinical utility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training regimes focused on sequential medical reasoning may be needed before LLMs can close the gap shown by AgentClinic.
The benchmark could be adapted to study multi-turn collaboration between AI agents and human clinicians in the same case.
Real-world deployment of these models would likely require safeguards against the specific failure modes exposed by incomplete-information scenarios.

Load-bearing premise

The simulated patient interactions, data collection, and tool use capture enough of the uncertainties and complexities of real clinical decision-making that performance gaps observed here will appear in actual practice.

What would settle it

Running the same LLMs on matched real patient cases in a hospital setting and obtaining diagnostic accuracies close to their static MedQA scores rather than the much lower AgentClinic scores.

read the original abstract

Evaluating large language models (LLM) in clinical scenarios is crucial to assessing their potential clinical utility. Existing benchmarks rely heavily on static question-answering, which does not accurately depict the complex, sequential nature of clinical decision-making. Here, we introduce AgentClinic, a multimodal agent benchmark for evaluating LLMs in simulated clinical environments that include patient interactions, multimodal data collection under incomplete information, and the usage of various tools, resulting in an in-depth evaluation across nine medical specialties and seven languages. We find that solving MedQA problems in the sequential decision-making format of AgentClinic is considerably more challenging, resulting in diagnostic accuracies that can drop to below a tenth of the original accuracy. Overall, we observe that agents sourced from Claude-3.5 outperform other LLM backbones in most settings. Nevertheless, we see stark differences in the LLMs' ability to make use of tools, such as experiential learning, adaptive retrieval, and reflection cycles. Strikingly, Llama-3 shows up to 92% relative improvements with the notebook tool that allows for writing and editing notes that persist across cases. To further scrutinize our clinical simulations, we leverage real-world electronic health records, perform a clinical reader study, perturb agents with biases, and explore novel patient-centric metrics that this interactive environment firstly enables.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. This paper introduces AgentClinic, a multimodal agent benchmark for evaluating LLMs in simulated clinical environments featuring patient interactions, multimodal data collection under incomplete information, and tool usage across nine medical specialties and seven languages. It reports that adapting MedQA problems to this sequential decision-making format substantially increases difficulty, with diagnostic accuracies dropping to below one-tenth of the original static QA performance. Claude-3.5 agents generally outperform other backbones, with some models showing large gains from tools such as persistent notebooks; the simulations are scrutinized via real-world EHR data, a clinical reader study, bias perturbations, and novel patient-centric metrics.

Significance. If the simulated environments accurately capture the uncertainties, information-gathering costs, and sequential reasoning demands of real clinical practice, this benchmark would provide a valuable advance over static QA evaluations by enabling assessment of adaptive tool use, reflection, and multimodal reasoning in medicine. The multi-specialty and multi-language scope, plus the introduction of patient-centric metrics, are positive contributions that could help identify practical limitations of current LLMs in clinical settings.

major comments (1)

[Abstract] Abstract: The central claim that diagnostic accuracies drop below one-tenth of the original MedQA performance due to the shift to sequential decision-making is load-bearing. While the abstract states that simulations were scrutinized via real-world EHR data, a clinical reader study, bias perturbations, and new metrics, no explicit quantitative head-to-head comparisons (e.g., clinician agreement rates, information-acquisition curves, or trajectory similarity metrics on matched real-EHR vs. synthetic cases) are described. Without these, it remains unclear whether the observed drop reflects intended sequential complexity or simulation-specific design choices such as scripted patient responses or tool API granularity.

minor comments (1)

[Abstract] The abstract refers to 'novel patient-centric metrics that this interactive environment firstly enables' without naming or briefly defining them; adding one-sentence examples would improve immediate clarity for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment point-by-point below and commit to revisions that strengthen the abstract's support for our central claims without altering the underlying results.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that diagnostic accuracies drop below one-tenth of the original MedQA performance due to the shift to sequential decision-making is load-bearing. While the abstract states that simulations were scrutinized via real-world EHR data, a clinical reader study, bias perturbations, and new metrics, no explicit quantitative head-to-head comparisons (e.g., clinician agreement rates, information-acquisition curves, or trajectory similarity metrics on matched real-EHR vs. synthetic cases) are described. Without these, it remains unclear whether the observed drop reflects intended sequential complexity or simulation-specific design choices such as scripted patient responses or tool API granularity.

Authors: We agree that the abstract would be strengthened by more explicitly summarizing the quantitative validation results already present in the full manuscript. Section 4.2 details head-to-head comparisons with real-world EHR data, including trajectory similarity metrics and information-acquisition curves on matched cases. Section 4.3 reports clinician agreement rates from the reader study (e.g., diagnostic concordance and decision sequence overlap). Section 4.4 introduces and quantifies the novel patient-centric metrics. These analyses were designed to distinguish sequential complexity from simulation artifacts, including checks on patient response scripting and tool interfaces. We will revise the abstract to concisely include key quantitative findings from these sections (e.g., agreement rates and similarity scores) to better anchor the central claim. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no load-bearing derivations or self-referential reductions

full rationale

The paper introduces AgentClinic as a new multimodal simulation benchmark and reports direct experimental outcomes (accuracy drops when MedQA is reframed sequentially, Claude-3.5 outperforming other backbones, tool-use differences). No equations, fitted parameters, or uniqueness theorems are invoked to derive results; all headline numbers are measured quantities from running agents in the environment. Scrutiny steps (EHR comparison, reader study, bias perturbations) are additional empirical checks rather than circular justifications. No self-citation chain or ansatz smuggling supports the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the domain assumption that the constructed simulations capture essential features of clinical practice; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption Simulated patient interactions and incomplete-information data collection faithfully represent real clinical decision-making
Invoked to justify relevance of accuracy drops and tool-use findings to clinical utility.

pith-pipeline@v0.9.0 · 5777 in / 1138 out tokens · 45925 ms · 2026-05-21T13:48:07.832093+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

solving MedQA problems in the sequential decision-making format of AgentClinic is considerably more challenging, resulting in diagnostic accuracies that can drop to below a tenth of the original accuracy
Foundation.LawOfExistence law_of_existence unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

To further scrutinize our clinical simulations, we leverage real-world electronic health records, perform a clinical reader study, perturb agents with biases, and explore novel patient-centric metrics

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
cs.AI 2026-05 unverdicted novelty 8.0

RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.
DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents
cs.CV 2026-05 accept novelty 8.0

DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
cs.AI 2026-05 conditional novelty 8.0

PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models
cs.CL 2026-04 conditional novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 7.0

A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
Reinforcing Human Behavior Simulation via Verbal Feedback
cs.LG 2026-05 unverdicted novelty 6.0

DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.
ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

ClinSeekAgent automates active multimodal evidence seeking for clinical reasoning, improving LLM performance on raw EHR and CXR tasks while enabling distillation into smaller models.
CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?
cs.CL 2026-05 unverdicted novelty 6.0

CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.
BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents
cs.AI 2026-05 conditional novelty 6.0

BioMedArena releases a standardized toolkit with 147 biomedical benchmarks, 75 tools, and six harnesses that achieve SOTA results on eight tasks with a +15.03 percentage point average lift.
EndoGov: A knowledge-governed multi-agent expert system for endometrial cancer risk stratification
cs.MA 2026-04 unverdicted novelty 6.0

EndoGov uses specialist agents plus a governance layer with hard and soft rule paths to deliver guideline-compliant endometrial cancer risk stratification, reporting 0.943 accuracy and 0.93% logic-violation rate on TC...
MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors
cs.CL 2026-04 unverdicted novelty 6.0

MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.
Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models
cs.CL 2025-08 unverdicted novelty 6.0

MedCheck is a lifecycle checklist framework that audits 53 existing medical LLM benchmarks and identifies systemic gaps in clinical fidelity, contamination control, and safety metrics.
RDMA: Cost Effective Agent-Driven Rare Disease Mining from Electronic Health Records
cs.LG 2025-07 unverdicted novelty 6.0

RDMA equips small LLMs with abbreviation resolution, phenotype reasoning, and ontology tools to mine rare diseases from EHR notes, outperforming fine-tuned and RAG baselines at up to 10x lower inference cost.
Interactive Evaluation Requires a Design Science
cs.AI 2026-05 unverdicted novelty 5.0

Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axi...
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
cs.AI 2026-05 conditional novelty 5.0

The survey proposes the LIFE framework to unify fragmented research on collaboration, failure attribution, and self-evolution in LLM multi-agent systems into a progression toward self-organizing intelligence.
Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve
cs.AI 2026-04 unverdicted novelty 5.0

Evo-MedAgent adds three evolving memory stores to LLM agents for chest X-ray diagnosis, raising MCQ accuracy from 0.68 to 0.79 on GPT-5-mini and 0.76 to 0.87 on Gemini-3 Flash without any training.
RadAgents: Multimodal Agentic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows
cs.MA 2025-09 unverdicted novelty 5.0

RadAgents is a multi-agent framework coupling clinical priors with task-aware multimodal reasoning and radiologist-like workflows, plus grounding and retrieval-augmentation for conflict resolution in chest X-ray inter...
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
cs.AI 2025-08 unverdicted novelty 5.0

A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
Agent Laboratory: Using LLM Agents as Research Assistants
cs.HC 2025-01 conditional novelty 5.0

Agent Laboratory is an autonomous LLM framework that completes end-to-end research from idea to report and code, with human feedback improving quality and cutting expenses by 84% while reaching competitive ML performance.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
cs.AI 2025-04 accept novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
cs.CV 2025-03 unverdicted novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 20 Pith papers · 8 internal anchors

[1]

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection.arXiv preprint arXiv:2310.11511,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901
[3]

Red teaming large language models in medicine: Real-world insights on model behavior.medRxiv, pp

Crystal Tin-Tin Chang, Hodan Farah, Haiwen Gui, Shawheen Justin Rezaei, Charbel Bou-Khalil, Ye-Jean Park, Akshay Swaminathan, Jesutofunmi A Omiye, Akaash Kolluri, Akash Chaurasia, et al. Red teaming large language models in medicine: Real-world insights on model behavior.medRxiv, pp. 2024–04,

work page 2024
[4]

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079,

work page internal anchor Pith review arXiv
[5]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate.arXiv preprint arXiv:2305.14325,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Few shot chain-of-thought driven rea- soning to prompt llms for open ended medical question answering.arXiv preprint arXiv:2403.04890,

Ojas Gramopadhye, Saeel Sandeep Nachane, Prateek Chanda, Ganesh Ramakrishnan, Kshitij Sharad Jadhav, Yatin Nandwani, Dinesh Raghu, and Sachindra Joshi. Few shot chain-of-thought driven rea- soning to prompt llms for open ended medical question answering.arXiv preprint arXiv:2403.04890,

work page arXiv
[8]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[9]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Active retrieval augmented generation

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983,

work page arXiv
[11]

Pubmedqa: A dataset for biomedical research question answering.arXiv preprint arXiv:1909.06146,

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering.arXiv preprint arXiv:1909.06146,

work page arXiv 1909
[12]

Guidelines for rigorous evaluation of clinical llms for conversational reasoning.medRxiv, pp

Shreya Johri, Jaehwan Jeong, Benjamin A Tran, Daniel I Schlessinger, Shannon Wongvibulsin, Zhuo Ran Cai, Roxana Daneshjou, and Pranav Rajpurkar. Guidelines for rigorous evaluation of clinical llms for conversational reasoning.medRxiv, pp. 2023–09,

work page 2023
[13]

Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957,

Junkai Li, Siyu Wang, Meng Zhang, Weitao Li, Yunghwei Lai, Xinhui Kang, Weizhi Ma, and Yang Liu. Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957,

work page arXiv
[14]

Automatic interactive evaluation for large language models with state aware patient simulator.arXiv preprint arXiv:2403.08495,

Yusheng Liao, Yutong Meng, Yuhao Wang, Hongcheng Liu, Yanfeng Wang, and Yu Wang. Automatic interactive evaluation for large language models with state aware patient simulator.arXiv preprint arXiv:2403.08495,

work page arXiv
[15]

Can large language models reason about medical questions?arXiv preprint arXiv:2207.08143,

Valentin Liévin, Christoffer Egeberg Hother, and Ole Winther. Can large language models reason about medical questions?arXiv preprint arXiv:2207.08143,

work page arXiv
[16]

m&m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks

Zixian Ma, Weikai Huang, Jieyu Zhang, Tanmay Gupta, and Ranjay Krishna. m&m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks. InSynthetic Data for Computer Vision Workshop@ CVPR 2024,

work page 2024
[17]

Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452,

Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452,

work page arXiv
[18]

Atoolboxforsurfacing health equity harms and biases in large language models.arXiv preprint arXiv:2403.12025,

Stephen R Pfohl, Heather Cole-Lewis, Rory Sayres, Darlene Neal, Mercy Asiedu, Awa Dieng, Nenad Tomasev, QaziMamunurRashid, ShekoofehAzizi, NegarRostamzadeh, etal. Atoolboxforsurfacing health equity harms and biases in large language models.arXiv preprint arXiv:2403.12025,

work page arXiv
[19]

Tool Learning with Foundation Models

URL https://arxiv.org/abs/2304.08354. Janice A Sabin. Tackling implicit bias in health care.New England Journal of Medicine, 387(2): 105–107,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Addressing cognitive bias in medical language models.arXiv preprint arXiv:2402.08113,

Samuel Schmidgall, Carl Harris, Ime Essien, Daniel Olshvang, Tawsifur Rahman, Ji Woong Kim, Rojin Ziaei, Jason Eshraghian, Peter Abadir, and Rama Chellappa. Addressing cognitive bias in medical language models.arXiv preprint arXiv:2402.08113,

work page arXiv
[21]

Use of diagnostic imaging studies and associated radiation exposure for patients enrolled in large integrated health care systems, 1996-2010.Jama, 307(22):2400–2409,

Rebecca Smith-Bindman, Diana L Miglioretti, Eric Johnson, Choonsik Lee, Heather Spencer Feigelson, Michael Flynn, Robert T Greenlee, Randell L Kruger, Mark C Hornbrook, Douglas Roblin, et al. Use of diagnostic imaging studies and associated radiation exposure for patients enrolled in large integrated health care systems, 1996-2010.Jama, 307(22):2400–2409,

work page 1996
[22]

Medagents: Large language models as collaborators for zero-shot medical reasoning

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537,

work page arXiv
[23]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Towards conversational diagnostic ai.arXiv preprint arXiv:2401.05654,

Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tomasev, et al. Towards conversational diagnostic ai.arXiv preprint arXiv:2401.05654,

work page arXiv
[25]

Benchmarking retrieval-augmented generation for medicine.arXiv preprint arXiv:2402.13178,

Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. Benchmarking retrieval-augmented generation for medicine.arXiv preprint arXiv:2402.13178,

work page arXiv
[26]

Languagemodelsaresusceptibletoincorrectpatientself-diagnosis in medical applications

RojinZiaeiandSamuelSchmidgall. Languagemodelsaresusceptibletoincorrectpatientself-diagnosis in medical applications. InDeep Generative Models for Health Workshop NeurIPS 2023,

work page 2023
[27]

Evaluate the patient presenting with chest pain, palpitations, and shortness of breath

A. Agent Details A.1. Agents Patient agent The patient agent has knowledge of a provided set of symptoms and medical history, but lacks knowledge of the what the actual diagnosis is. The role of this agent is to interact with the doctor agent by providing symptom information and responding to inquiries in a way that mimics real patient experiences. 17 Age...

work page 1999
[28]

Thisdatasetincludes 4-5 multiple-choice questions, each accompanied by one correct answer, alongside explanations or references supporting the correct choice

dataset comprises a collection of medical question-answering pairs, sourcedfromMedicalLicensingExamfromtheUS,MainlandChina, andTaiwan. Thisdatasetincludes 4-5 multiple-choice questions, each accompanied by one correct answer, alongside explanations or references supporting the correct choice. The LLM is provided with all of the context for the question, s...

work page 2021
[29]

(human passing score is 60%, human expert score is 87% (Liévin et al., 2023)). Beyond the MedQA dataset, many other knowledge-based benchmarks have been proposed, such as PubMedQA (Jin et al., 2019), MedMCQA (Pal et al., 2022), MMLU clinical topics (Hendrycks et al., 2020), and MultiMedQA (Singhal et al., 2023), which follow a similar multiple-choice form...

work page 2023
[30]

The work of ref

and with multiple choice questions removed (Gramopadhye et al., 2024). The work of ref. (Schmidgall et al.,

work page 2024
[31]

shows that the introduction of a simple bias prompt can lead to large reductions in accuracy on the MedQA dataset and that this effect can be partially mitigated using various prompting techniques, such as one-shot or few-shot learning. B.3. Beyond exam questions Recent work toward red teaming LLMs in a medical context has shown that a large proportion of...

work page 2024
[32]

risk categories

proposes evaluating LLMs through natural dialogues on dermatology questions, however without the use of images. Additionally, neither of these works demonstrate performance in the presence of bias, with multimodal input, or using a measurement agent. There has also been work which shows simulated doctor agents can improve medical QA performance through tu...

work page 2023

[1] [1]

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection.arXiv preprint arXiv:2310.11511,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901

[3] [3]

Red teaming large language models in medicine: Real-world insights on model behavior.medRxiv, pp

Crystal Tin-Tin Chang, Hodan Farah, Haiwen Gui, Shawheen Justin Rezaei, Charbel Bou-Khalil, Ye-Jean Park, Akshay Swaminathan, Jesutofunmi A Omiye, Akaash Kolluri, Akash Chaurasia, et al. Red teaming large language models in medicine: Real-world insights on model behavior.medRxiv, pp. 2024–04,

work page 2024

[4] [4]

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079,

work page internal anchor Pith review arXiv

[5] [5]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate.arXiv preprint arXiv:2305.14325,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Few shot chain-of-thought driven rea- soning to prompt llms for open ended medical question answering.arXiv preprint arXiv:2403.04890,

Ojas Gramopadhye, Saeel Sandeep Nachane, Prateek Chanda, Ganesh Ramakrishnan, Kshitij Sharad Jadhav, Yatin Nandwani, Dinesh Raghu, and Sachindra Joshi. Few shot chain-of-thought driven rea- soning to prompt llms for open ended medical question answering.arXiv preprint arXiv:2403.04890,

work page arXiv

[8] [8]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[9] [9]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Active retrieval augmented generation

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983,

work page arXiv

[11] [11]

Pubmedqa: A dataset for biomedical research question answering.arXiv preprint arXiv:1909.06146,

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering.arXiv preprint arXiv:1909.06146,

work page arXiv 1909

[12] [12]

Guidelines for rigorous evaluation of clinical llms for conversational reasoning.medRxiv, pp

Shreya Johri, Jaehwan Jeong, Benjamin A Tran, Daniel I Schlessinger, Shannon Wongvibulsin, Zhuo Ran Cai, Roxana Daneshjou, and Pranav Rajpurkar. Guidelines for rigorous evaluation of clinical llms for conversational reasoning.medRxiv, pp. 2023–09,

work page 2023

[13] [13]

Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957,

Junkai Li, Siyu Wang, Meng Zhang, Weitao Li, Yunghwei Lai, Xinhui Kang, Weizhi Ma, and Yang Liu. Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957,

work page arXiv

[14] [14]

Automatic interactive evaluation for large language models with state aware patient simulator.arXiv preprint arXiv:2403.08495,

Yusheng Liao, Yutong Meng, Yuhao Wang, Hongcheng Liu, Yanfeng Wang, and Yu Wang. Automatic interactive evaluation for large language models with state aware patient simulator.arXiv preprint arXiv:2403.08495,

work page arXiv

[15] [15]

Can large language models reason about medical questions?arXiv preprint arXiv:2207.08143,

Valentin Liévin, Christoffer Egeberg Hother, and Ole Winther. Can large language models reason about medical questions?arXiv preprint arXiv:2207.08143,

work page arXiv

[16] [16]

m&m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks

Zixian Ma, Weikai Huang, Jieyu Zhang, Tanmay Gupta, and Ranjay Krishna. m&m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks. InSynthetic Data for Computer Vision Workshop@ CVPR 2024,

work page 2024

[17] [17]

Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452,

Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452,

work page arXiv

[18] [18]

Atoolboxforsurfacing health equity harms and biases in large language models.arXiv preprint arXiv:2403.12025,

Stephen R Pfohl, Heather Cole-Lewis, Rory Sayres, Darlene Neal, Mercy Asiedu, Awa Dieng, Nenad Tomasev, QaziMamunurRashid, ShekoofehAzizi, NegarRostamzadeh, etal. Atoolboxforsurfacing health equity harms and biases in large language models.arXiv preprint arXiv:2403.12025,

work page arXiv

[19] [19]

Tool Learning with Foundation Models

URL https://arxiv.org/abs/2304.08354. Janice A Sabin. Tackling implicit bias in health care.New England Journal of Medicine, 387(2): 105–107,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Addressing cognitive bias in medical language models.arXiv preprint arXiv:2402.08113,

Samuel Schmidgall, Carl Harris, Ime Essien, Daniel Olshvang, Tawsifur Rahman, Ji Woong Kim, Rojin Ziaei, Jason Eshraghian, Peter Abadir, and Rama Chellappa. Addressing cognitive bias in medical language models.arXiv preprint arXiv:2402.08113,

work page arXiv

[21] [21]

Use of diagnostic imaging studies and associated radiation exposure for patients enrolled in large integrated health care systems, 1996-2010.Jama, 307(22):2400–2409,

Rebecca Smith-Bindman, Diana L Miglioretti, Eric Johnson, Choonsik Lee, Heather Spencer Feigelson, Michael Flynn, Robert T Greenlee, Randell L Kruger, Mark C Hornbrook, Douglas Roblin, et al. Use of diagnostic imaging studies and associated radiation exposure for patients enrolled in large integrated health care systems, 1996-2010.Jama, 307(22):2400–2409,

work page 1996

[22] [22]

Medagents: Large language models as collaborators for zero-shot medical reasoning

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537,

work page arXiv

[23] [23]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Towards conversational diagnostic ai.arXiv preprint arXiv:2401.05654,

Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tomasev, et al. Towards conversational diagnostic ai.arXiv preprint arXiv:2401.05654,

work page arXiv

[25] [25]

Benchmarking retrieval-augmented generation for medicine.arXiv preprint arXiv:2402.13178,

Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. Benchmarking retrieval-augmented generation for medicine.arXiv preprint arXiv:2402.13178,

work page arXiv

[26] [26]

Languagemodelsaresusceptibletoincorrectpatientself-diagnosis in medical applications

RojinZiaeiandSamuelSchmidgall. Languagemodelsaresusceptibletoincorrectpatientself-diagnosis in medical applications. InDeep Generative Models for Health Workshop NeurIPS 2023,

work page 2023

[27] [27]

Evaluate the patient presenting with chest pain, palpitations, and shortness of breath

A. Agent Details A.1. Agents Patient agent The patient agent has knowledge of a provided set of symptoms and medical history, but lacks knowledge of the what the actual diagnosis is. The role of this agent is to interact with the doctor agent by providing symptom information and responding to inquiries in a way that mimics real patient experiences. 17 Age...

work page 1999

[28] [28]

Thisdatasetincludes 4-5 multiple-choice questions, each accompanied by one correct answer, alongside explanations or references supporting the correct choice

dataset comprises a collection of medical question-answering pairs, sourcedfromMedicalLicensingExamfromtheUS,MainlandChina, andTaiwan. Thisdatasetincludes 4-5 multiple-choice questions, each accompanied by one correct answer, alongside explanations or references supporting the correct choice. The LLM is provided with all of the context for the question, s...

work page 2021

[29] [29]

(human passing score is 60%, human expert score is 87% (Liévin et al., 2023)). Beyond the MedQA dataset, many other knowledge-based benchmarks have been proposed, such as PubMedQA (Jin et al., 2019), MedMCQA (Pal et al., 2022), MMLU clinical topics (Hendrycks et al., 2020), and MultiMedQA (Singhal et al., 2023), which follow a similar multiple-choice form...

work page 2023

[30] [30]

The work of ref

and with multiple choice questions removed (Gramopadhye et al., 2024). The work of ref. (Schmidgall et al.,

work page 2024

[31] [31]

shows that the introduction of a simple bias prompt can lead to large reductions in accuracy on the MedQA dataset and that this effect can be partially mitigated using various prompting techniques, such as one-shot or few-shot learning. B.3. Beyond exam questions Recent work toward red teaming LLMs in a medical context has shown that a large proportion of...

work page 2024

[32] [32]

risk categories

proposes evaluating LLMs through natural dialogues on dermatology questions, however without the use of images. Additionally, neither of these works demonstrate performance in the presence of bias, with multimodal input, or using a measurement agent. There has also been work which shows simulated doctor agents can improve medical QA performance through tu...

work page 2023