MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models

Bin Han; Chuchu Jiang; Jiangyuan Chen; Jie Xu; Jinru Ding; Lu Lu; Meiling Liu; Mouxiao Bian; Renjie Lu; Ruiyao Chen

arxiv: 2606.24155 · v3 · pith:TDBCV6ZHnew · submitted 2026-06-23 · 💻 cs.CL

MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models

Jinru Ding , Chuchu Jiang , Lu Lu , Wenrao Pang , Mouxiao Bian , Zhuangzhi Gao , Jiangyuan Chen , Xinwei Peng

show 7 more authors

Ruiyao Chen Sijie Ren Renjie Lu Yun Zhong Bin Han Meiling Liu Jie Xu

This is my paper

Pith reviewed 2026-06-26 05:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords clinical multimodal modelsprocess-oriented benchmarkhallucination propagationinformation-flow stressorsdynamic evaluationmedical AIcontradiction detectionfailure fingerprints

0 comments

The pith

MedBench v5 shows that high aggregate scores on clinical multimodal tasks do not ensure stable reasoning when information is omitted, contradicted, or delayed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

MedBench v5 replaces static question-answering tests with a dynamic benchmark that tracks how clinical multimodal models handle changing information flows across 63 tasks. The benchmark combines fourteen cognitive-responsiveness dimensions with four agent skill environments and applies three controllable stressors—omission, contradiction, and evidence delay—to isolate where performance breaks. A five-node process audit records each model’s trajectory through contradiction detection, diagnosis updating, hallucination initiation and propagation, and self-correction. Experiments on frontier models demonstrate that strong final-answer accuracy can coexist with clear instability in these intermediate steps, while evidence grounding at the end often remains superficially intact. The resulting model-specific failure fingerprints supply a finer-grained alternative to overall accuracy metrics.

Core claim

MedBench v5 establishes a dual-dimensional evaluation framework that pairs Clinical Cognitive Responsiveness with Medical Atomic Skills, applies switchable information-flow stressors, and audits five reasoning nodes to produce failure fingerprints; experiments reveal that stressors primarily impair contradiction detection, diagnosis updating, hallucination propagation, and contradiction-based self-correction even when final evidence grounding appears stable.

What carries the argument

The five-node dynamic process audit protocol that records trajectories through contradiction detection, diagnosis updating, hallucination propagation, and self-correction under three switchable stressors.

If this is right

Aggregate task accuracy becomes insufficient for certifying clinical multimodal models.
Hallucination monitoring must track initiation, propagation, anchoring, and contradiction interaction rather than final output alone.
Process-specific stress testing can generate model fingerprints that guide targeted improvements in contradiction handling and diagnosis updating.
Unified infrastructure now exists for capability profiling, controllable degradation analysis, and hallucination trajectory tracing.
Final-answer stability under stress does not imply internal reasoning stability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could use the failure fingerprints to prioritize training objectives that stabilize intermediate reasoning nodes rather than only final outputs.
The benchmark could be extended to compare degradation patterns across language-only, vision-language, and agent-based clinical systems in head-to-head trials.
Real-world deployment logs could be replayed through the same stressor protocol to test whether laboratory fingerprints predict field failures.
Hospitals might adopt the audit protocol as a recurring stress test before approving model updates for patient-facing use.

Load-bearing premise

The chosen stressors and five-node audit protocol meaningfully capture how clinical reasoning degrades in practice.

What would settle it

A controlled study that applies the same omission, contradiction, and delay stressors to the same models inside actual clinical workflows and checks whether the same process instabilities appear.

Figures

Figures reproduced from arXiv: 2606.24155 by Bin Han, Chuchu Jiang, Jiangyuan Chen, Jie Xu, Jinru Ding, Lu Lu, Meiling Liu, Mouxiao Bian, Renjie Lu, Ruiyao Chen, Sijie Ren, Wenrao Pang, Xinwei Peng, Yun Zhong, Zhuangzhi Gao.

**Figure 1.** Figure 1: • DataAgent evaluates clinical data interaction over structured and semi-structured sources, including MySQL, PostgreSQL, CSV files, and unstructured clinical text. Given a user request, the agent performs multi-turn 4 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

read the original abstract

Existing medical AI benchmarks lack process visibility, atomic skill evaluation, and integrated hallucination detection. We introduce MedBench v5, a redesigned benchmark for clinical multimodal models (language, vision-language, and agent systems) that moves from static QA to dynamic, process-oriented evaluation. MedBench v5 features: (1) a dual-dimensional framework combining Clinical Cognitive Responsiveness (14 sub-dimensions) and Medical Atomic Skills (4 agent environments), covering 63 tasks; (2) three switchable information-flow stressors (omission, contradiction, evidence delay) for factorized degradation analysis; (3) a dynamic process audit protocol with five reasoning nodes that produces model-specific failure fingerprints; (4) hallucination propagation monitoring across initiation, propagation, anchoring, and contradiction interaction-capturing silent hallucination. Experiments on frontier models show that strong overall task performance does not guarantee process stability: stressors mainly disrupt contradiction detection, diagnosis updating, hallucination propagation, and contradiction-based self-correction, while final evidence grounding can remain superficially stable. MedBench v5 provides a unified infrastructure for capability profiling, controllable stress testing, process auditing, and hallucination trajectory analysis in clinical AI evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedBench v5 structures a process audit and stressors for clinical multimodal models but supplies no external check that those elements match real clinical failure modes.

read the letter

MedBench v5 is a benchmark that moves medical AI evaluation from static QA to a dynamic setup with process tracking and controllable stressors. The reported experiments show frontier models keeping final evidence grounding while losing stability on contradiction detection, diagnosis updating, hallucination propagation, and self-correction.

What is new is the specific combination: a dual framework of 14 cognitive responsiveness sub-dimensions plus four agent environments for 63 tasks, three switchable stressors (omission, contradiction, evidence delay), a five-node dynamic audit that outputs failure fingerprints, and stage-by-stage hallucination monitoring. This package is more granular than most existing medical benchmarks.

The paper does a reasonable job of describing an infrastructure that separates task outcome from reasoning path and makes stress testing repeatable. That level of visibility is useful when the goal is safety profiling rather than leaderboard scores.

The soft spot is the missing link between the chosen stressors and audit nodes and actual clinical degradation. The abstract and design give no mapping to documented medical error taxonomies, no clinician review of the resulting fingerprints, and no comparison to observed multimodal model failures in practice. Without that, the dissociation between accuracy and process stability could be an artifact of how the test was built. The stress-test concern holds up on the supplied description.

This is for groups that evaluate or develop clinical multimodal systems and want more than end-to-end metrics. A reader working on benchmark design or safety testing would extract concrete ideas.

It deserves peer review because the topic matters and the framework is detailed enough to be worth tightening. Recommendation: send it to referees and flag the need for external anchoring of the process components.

Referee Report

1 major / 0 minor

Summary. The paper introduces MedBench v5, a benchmark for clinical multimodal models that shifts evaluation from static QA to a dynamic, process-oriented framework. It combines a dual-dimensional structure (Clinical Cognitive Responsiveness with 14 sub-dimensions and Medical Atomic Skills across 4 agent environments, totaling 63 tasks), three switchable information-flow stressors (omission, contradiction, evidence delay), a five-node dynamic process audit protocol that generates model-specific failure fingerprints, and monitoring of hallucination propagation (initiation, propagation, anchoring, contradiction interaction). Experiments on frontier models are reported to show that high overall task performance does not ensure process stability, with stressors primarily disrupting contradiction detection, diagnosis updating, hallucination propagation, and contradiction-based self-correction, while final evidence grounding can appear superficially stable.

Significance. If the stressors and five-node audit protocol are demonstrated to capture clinically relevant degradation modes, MedBench v5 could advance the field by enabling capability profiling, controllable stress testing, and hallucination trajectory analysis beyond aggregate accuracy metrics. The unified infrastructure for process auditing and silent hallucination detection represents a constructive step toward more diagnostic evaluations of clinical AI systems.

major comments (1)

[Abstract (experimental results) and description of dynamic process audit protocol] The central experimental claim—that stressors selectively disrupt contradiction detection, diagnosis updating, hallucination propagation, and self-correction while evidence grounding remains stable—rests on the assumption that the three information-flow stressors and five-node process audit protocol factorize real clinical degradation modes. The manuscript supplies no external anchoring such as mapping to documented clinical error taxonomies, clinician validation of the resulting fingerprints, or comparison against observed real-world multimodal model failures in clinical settings. Absent this grounding, the reported dissociation between task accuracy and process stability risks being an artifact of the synthetic stressor design.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive critique of the experimental grounding. We respond to the single major comment below and commit to revisions that clarify the scope of our claims without overstating the benchmark's clinical fidelity.

read point-by-point responses

Referee: The central experimental claim—that stressors selectively disrupt contradiction detection, diagnosis updating, hallucination propagation, and self-correction while evidence grounding remains stable—rests on the assumption that the three information-flow stressors and five-node process audit protocol factorize real clinical degradation modes. The manuscript supplies no external anchoring such as mapping to documented clinical error taxonomies, clinician validation of the resulting fingerprints, or comparison against observed real-world multimodal model failures in clinical settings. Absent this grounding, the reported dissociation between task accuracy and process stability risks being an artifact of the synthetic stressor design.

Authors: We agree that the manuscript lacks explicit external anchoring via clinical error taxonomies, clinician validation, or direct comparison to real-world failures. The stressors were motivated by information-flow vulnerabilities commonly discussed in clinical reasoning literature, but this motivation is internal to the benchmark design and does not constitute empirical validation against observed clinical data. The reported dissociation is therefore an observation within the controlled synthetic environment rather than a claim of direct factorization of real-world modes. In revision we will (1) add explicit references to clinical error categories that informed stressor selection, (2) insert a dedicated limitations subsection stating the absence of clinician validation and real-world mapping, and (3) rephrase the abstract and discussion to emphasize that the benchmark enables controlled stress testing rather than claiming to replicate clinical degradation modes. These changes will make the synthetic scope of the results transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark description with no derivations or fitted quantities

full rationale

The paper introduces MedBench v5 as a new benchmark with defined components (dual-dimensional framework, stressors, five-node audit protocol, hallucination monitoring) but contains no equations, derivations, parameter fitting, or predictions that reduce to inputs by construction. No self-citation chains support load-bearing claims, and the work is self-contained as a descriptive evaluation infrastructure without internal circular reductions. This matches the default expectation for non-circular benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, or invented entities; the paper introduces an evaluation framework rather than a derivation.

pith-pipeline@v0.9.1-grok · 5794 in / 1010 out tokens · 18668 ms · 2026-06-26T05:40:27.169694+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 7 linked inside Pith

[1]

Beyond the leaderboard: Rethinking medical benchmarks for large language models.arXiv preprint arXiv:2508.04325,

Wenting Chen, Guo Yu, Yiu-Fai Cheung, Meidan Ding, Jie Liu, Zizhan Ma, Wenxuan Wang, and Linlin Shen. Beyond the leaderboard: Rethinking medical benchmarks for large language models.arXiv preprint arXiv:2508.04325,

Pith/arXiv arXiv
[2]

Evaluating large language model di- agnostic performance on jama clinical challenges via a multi-agent conversational framework.medRxiv, pages 2025–08,

Karl L Sangwon, Jeff Zhang, Robert Steele, Jaden Stryker, Jin Vivian Lee, Joanne Choi, Krithik Vishwanath, Daniel Alexander Alber, Douglas Kondziolka, Michal Mankowski, et al. Evaluating large language model di- agnostic performance on jama clinical challenges via a multi-agent conversational framework.medRxiv, pages 2025–08,

2025
[3]

Medconceal: A benchmark for clinical hidden-concern reasoning under partial observability.arXiv preprint arXiv:2604.08788,

Yikun Han, Joey Chan, Jingyuan Chen, Mengting Ai, Simo Du, and Yue Guo. Medconceal: A benchmark for clinical hidden-concern reasoning under partial observability.arXiv preprint arXiv:2604.08788,

Pith/arXiv arXiv
[4]

Meddialbench: Benchmarking llm diagnostic robustness under parametric adversarial patient behaviors.arXiv preprint arXiv:2604.06846,

Xiaotian Luo, Xun Jiang, and Jiangcheng Wu. Meddialbench: Benchmarking llm diagnostic robustness under parametric adversarial patient behaviors.arXiv preprint arXiv:2604.06846,

Pith/arXiv arXiv
[5]

15 Weixiang Yan, Haitian Liu, Tengxiao Wu, Qian Chen, Wen Wang, Haoyuan Chai, and Jiayi Wang

URLhttps://arxiv.org/abs/2606.03416. 15 Weixiang Yan, Haitian Liu, Tengxiao Wu, Qian Chen, Wen Wang, Haoyuan Chai, and Jiayi Wang. Clinicallab: Aligning agents for multi-departmental clinical diagnostics in the real world.Advances in Neural Information Processing Systems, 38,

Pith/arXiv arXiv
[6]

Can we trust ai doctors? a survey of medical hallucination in large language and large vision-language models

Zhihong Zhu, Yunyan Zhang, Xianwei Zhuang, Fan Zhang, Zhongwei Wan, Yuyan Chen, QingqingLong QingqingLong, Yefeng Zheng, and Xian Wu. Can we trust ai doctors? a survey of medical hallucination in large language and large vision-language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 6748–6769,

2025
[7]

Medical hallucinations in foundation models and their impact on healthcare

Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, et al. Medical hallucinations in foundation models and their impact on healthcare. arXiv preprint arXiv:2503.05777,

arXiv
[8]

Medhallu: A comprehensive benchmark for detecting medical hallucinations in large language models

Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, and Ying Ding. Medhallu: A comprehensive benchmark for detecting medical hallucinations in large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2858–2873,

2025
[9]

Detecting and evaluating medical hallucinations in large vision language models.arXiv preprint arXiv:2406.10185,

Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, Shunli Wang, Dongling Xiao, Ke Li, and Lihua Zhang. Detecting and evaluating medical hallucinations in large vision language models.arXiv preprint arXiv:2406.10185,

Pith/arXiv arXiv
[10]

From scores to steps: Diagnosing and improving llm performance in evidence-based medical calculations

Benlu Wang, Iris Xia, Yifan Zhang, Junda Wang, Feiyun Ouyang, Shuo Han, Arman Cohan, Hong Yu, and Zonghai Yao. From scores to steps: Diagnosing and improving llm performance in evidence-based medical calculations. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 10820–10844,

2025
[11]

Beyond idealized patients: Evaluating llms under challenging patient behaviors in medical consultations.arXiv preprint arXiv:2603.29373,

Yahan Li, Xinyi Jie, Wanjia Ruan, Xubei Zhang, Huaijie Zhu, Yicheng Gao, Chaohao Du, and Ruishan Liu. Beyond idealized patients: Evaluating llms under challenging patient behaviors in medical consultations.arXiv preprint arXiv:2603.29373,

arXiv
[12]

Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records

Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce C Ho, Carl Yang, and May Dongmei Wang. Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22315–22339,

2024
[13]

Benchmarking retrieval-augmented generation for medicine

Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. Benchmarking retrieval-augmented generation for medicine. InFindings of the Association for Computational Linguistics: ACL 2024, pages 6233–6251,

2024
[14]

Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096,

Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al. Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096,

arXiv
[15]

Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents

16 Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents. In International Conference on Learning Representations, volume 2025, pages 35331–35366,

2025
[16]

Mhb: Medical hallucination benchmark for large language models in complex clinical tasks

Jianrong Lu, Junwei Liu, Xingyun Zheng, Minghui Yang, Jian Wang, Ping Wang, and Yechao Zhang. Mhb: Medical hallucination benchmark for large language models in complex clinical tasks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38971–38978, 2026b. Sicheng Yang, Hangjie Yuan, Wenjun Zhang, Jinwang Wang, Yichen Qian, We...

arXiv
[17]

Openrt: An open-source red teaming framework for multimodal llms.arXiv preprint arXiv:2601.01592,

Xin Wang, Yunhao Chen, Juncheng Li, Yixu Wang, Yang Yao, Tianle Gu, Jie Li, Yan Teng, Yingchun Wang, and Xia Hu. Openrt: An open-source red teaming framework for multimodal llms.arXiv preprint arXiv:2601.01592,

arXiv
[18]

Jiazhen Pan, Weixiang Shen, Jun Li, Julian Canisius, Felix Bitzer, Paula Roßmüller, Jiancheng Yang, Virginie Kreutzinger, Daniel Rueckert, and Benedikt Wiestler

URLhttps://arxiv.org/abs/2601.19773. Jiazhen Pan, Weixiang Shen, Jun Li, Julian Canisius, Felix Bitzer, Paula Roßmüller, Jiancheng Yang, Virginie Kreutzinger, Daniel Rueckert, and Benedikt Wiestler. Ddx-trace: A benchmark for medical diagnostic trajectories in vlms,

arXiv
[19]

Stella X Wang

URLhttps://arxiv.org/abs/2605.23629. Stella X Wang. Measuring the unmeasurable: A diagnostic sensor for ai reasoning pathology in sequential clinical decision-making.medRxiv, pages 2026–03,

Pith/arXiv arXiv 2026
[20]

Dongyang Fan, Sebastien Delsad, Nicolas Flammarion, and Maksym Andriushchenko

URLhttps://arxiv.org/abs/2605.12882. Dongyang Fan, Sebastien Delsad, Nicolas Flammarion, and Maksym Andriushchenko. Halluhard: A hard multi-turn hallucination benchmark,

Pith/arXiv arXiv
[21]

most recent

URLhttps://arxiv.org/abs/2602.01031. A Mutimodal tasks for Clinical Cognitive Responsiveness and Medical Atomic Skills A.1 Clinical Cognitive Responsiveness The databases of Clinical Cognitive Responsiveness are listed in table 6 Table 6: Overview of Clinical Cognitive Responsiveness Dimension Dataset Metrics Description Medical Knowledge QA MedExam Accur...

arXiv 2026

[1] [1]

Beyond the leaderboard: Rethinking medical benchmarks for large language models.arXiv preprint arXiv:2508.04325,

Wenting Chen, Guo Yu, Yiu-Fai Cheung, Meidan Ding, Jie Liu, Zizhan Ma, Wenxuan Wang, and Linlin Shen. Beyond the leaderboard: Rethinking medical benchmarks for large language models.arXiv preprint arXiv:2508.04325,

Pith/arXiv arXiv

[2] [2]

Evaluating large language model di- agnostic performance on jama clinical challenges via a multi-agent conversational framework.medRxiv, pages 2025–08,

Karl L Sangwon, Jeff Zhang, Robert Steele, Jaden Stryker, Jin Vivian Lee, Joanne Choi, Krithik Vishwanath, Daniel Alexander Alber, Douglas Kondziolka, Michal Mankowski, et al. Evaluating large language model di- agnostic performance on jama clinical challenges via a multi-agent conversational framework.medRxiv, pages 2025–08,

2025

[3] [3]

Medconceal: A benchmark for clinical hidden-concern reasoning under partial observability.arXiv preprint arXiv:2604.08788,

Yikun Han, Joey Chan, Jingyuan Chen, Mengting Ai, Simo Du, and Yue Guo. Medconceal: A benchmark for clinical hidden-concern reasoning under partial observability.arXiv preprint arXiv:2604.08788,

Pith/arXiv arXiv

[4] [4]

Meddialbench: Benchmarking llm diagnostic robustness under parametric adversarial patient behaviors.arXiv preprint arXiv:2604.06846,

Xiaotian Luo, Xun Jiang, and Jiangcheng Wu. Meddialbench: Benchmarking llm diagnostic robustness under parametric adversarial patient behaviors.arXiv preprint arXiv:2604.06846,

Pith/arXiv arXiv

[5] [5]

15 Weixiang Yan, Haitian Liu, Tengxiao Wu, Qian Chen, Wen Wang, Haoyuan Chai, and Jiayi Wang

URLhttps://arxiv.org/abs/2606.03416. 15 Weixiang Yan, Haitian Liu, Tengxiao Wu, Qian Chen, Wen Wang, Haoyuan Chai, and Jiayi Wang. Clinicallab: Aligning agents for multi-departmental clinical diagnostics in the real world.Advances in Neural Information Processing Systems, 38,

Pith/arXiv arXiv

[6] [6]

Can we trust ai doctors? a survey of medical hallucination in large language and large vision-language models

Zhihong Zhu, Yunyan Zhang, Xianwei Zhuang, Fan Zhang, Zhongwei Wan, Yuyan Chen, QingqingLong QingqingLong, Yefeng Zheng, and Xian Wu. Can we trust ai doctors? a survey of medical hallucination in large language and large vision-language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 6748–6769,

2025

[7] [7]

Medical hallucinations in foundation models and their impact on healthcare

Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, et al. Medical hallucinations in foundation models and their impact on healthcare. arXiv preprint arXiv:2503.05777,

arXiv

[8] [8]

Medhallu: A comprehensive benchmark for detecting medical hallucinations in large language models

Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, and Ying Ding. Medhallu: A comprehensive benchmark for detecting medical hallucinations in large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2858–2873,

2025

[9] [9]

Detecting and evaluating medical hallucinations in large vision language models.arXiv preprint arXiv:2406.10185,

Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, Shunli Wang, Dongling Xiao, Ke Li, and Lihua Zhang. Detecting and evaluating medical hallucinations in large vision language models.arXiv preprint arXiv:2406.10185,

Pith/arXiv arXiv

[10] [10]

From scores to steps: Diagnosing and improving llm performance in evidence-based medical calculations

Benlu Wang, Iris Xia, Yifan Zhang, Junda Wang, Feiyun Ouyang, Shuo Han, Arman Cohan, Hong Yu, and Zonghai Yao. From scores to steps: Diagnosing and improving llm performance in evidence-based medical calculations. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 10820–10844,

2025

[11] [11]

Beyond idealized patients: Evaluating llms under challenging patient behaviors in medical consultations.arXiv preprint arXiv:2603.29373,

Yahan Li, Xinyi Jie, Wanjia Ruan, Xubei Zhang, Huaijie Zhu, Yicheng Gao, Chaohao Du, and Ruishan Liu. Beyond idealized patients: Evaluating llms under challenging patient behaviors in medical consultations.arXiv preprint arXiv:2603.29373,

arXiv

[12] [12]

Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records

Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce C Ho, Carl Yang, and May Dongmei Wang. Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22315–22339,

2024

[13] [13]

Benchmarking retrieval-augmented generation for medicine

Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. Benchmarking retrieval-augmented generation for medicine. InFindings of the Association for Computational Linguistics: ACL 2024, pages 6233–6251,

2024

[14] [14]

Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096,

Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al. Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096,

arXiv

[15] [15]

Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents

16 Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents. In International Conference on Learning Representations, volume 2025, pages 35331–35366,

2025

[16] [16]

Mhb: Medical hallucination benchmark for large language models in complex clinical tasks

Jianrong Lu, Junwei Liu, Xingyun Zheng, Minghui Yang, Jian Wang, Ping Wang, and Yechao Zhang. Mhb: Medical hallucination benchmark for large language models in complex clinical tasks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38971–38978, 2026b. Sicheng Yang, Hangjie Yuan, Wenjun Zhang, Jinwang Wang, Yichen Qian, We...

arXiv

[17] [17]

Openrt: An open-source red teaming framework for multimodal llms.arXiv preprint arXiv:2601.01592,

Xin Wang, Yunhao Chen, Juncheng Li, Yixu Wang, Yang Yao, Tianle Gu, Jie Li, Yan Teng, Yingchun Wang, and Xia Hu. Openrt: An open-source red teaming framework for multimodal llms.arXiv preprint arXiv:2601.01592,

arXiv

[18] [18]

Jiazhen Pan, Weixiang Shen, Jun Li, Julian Canisius, Felix Bitzer, Paula Roßmüller, Jiancheng Yang, Virginie Kreutzinger, Daniel Rueckert, and Benedikt Wiestler

URLhttps://arxiv.org/abs/2601.19773. Jiazhen Pan, Weixiang Shen, Jun Li, Julian Canisius, Felix Bitzer, Paula Roßmüller, Jiancheng Yang, Virginie Kreutzinger, Daniel Rueckert, and Benedikt Wiestler. Ddx-trace: A benchmark for medical diagnostic trajectories in vlms,

arXiv

[19] [19]

Stella X Wang

URLhttps://arxiv.org/abs/2605.23629. Stella X Wang. Measuring the unmeasurable: A diagnostic sensor for ai reasoning pathology in sequential clinical decision-making.medRxiv, pages 2026–03,

Pith/arXiv arXiv 2026

[20] [20]

Dongyang Fan, Sebastien Delsad, Nicolas Flammarion, and Maksym Andriushchenko

URLhttps://arxiv.org/abs/2605.12882. Dongyang Fan, Sebastien Delsad, Nicolas Flammarion, and Maksym Andriushchenko. Halluhard: A hard multi-turn hallucination benchmark,

Pith/arXiv arXiv

[21] [21]

most recent

URLhttps://arxiv.org/abs/2602.01031. A Mutimodal tasks for Clinical Cognitive Responsiveness and Medical Atomic Skills A.1 Clinical Cognitive Responsiveness The databases of Clinical Cognitive Responsiveness are listed in table 6 Table 6: Overview of Clinical Cognitive Responsiveness Dimension Dataset Metrics Description Medical Knowledge QA MedExam Accur...

arXiv 2026