pith. sign in

arxiv: 2606.12291 · v1 · pith:GZJCXXMEnew · submitted 2026-06-10 · 💻 cs.CL

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

Pith reviewed 2026-06-27 09:28 UTC · model grok-4.3

classification 💻 cs.CL
keywords epistemic resiliencelarge language modelsmedical reasoningadversarial robustnessmisleading contextMedMisBenchLLM evaluation
0
0 comments X

The pith

LLMs abandon correct medical answers when misleading context is injected into questions they previously answered correctly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether high performance on medical exams means LLMs can safely handle health advice. It defines epistemic resilience as maintaining correct judgment under misleading or adversarial context. To measure this, the authors built MedMisBench with thousands of medical questions and pairs of misleading contexts. Testing across multiple models shows a large drop in accuracy when misleading information is added, especially certain types of formal fabrications. Clinicians reviewing cases see potential for serious harm, pointing to a missing aspect in current LLM medical evaluations.

Core claim

When misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications such as authority-framed falsehoods.

What carries the argument

MedMisBench, a collection of 10,932 medical question items and 48,889 misleading context-option pairs that tests whether models preserve correct answers when given misleading medical information.

If this is right

  • Authority-framed falsehoods achieve 69.5% attack success and exception-poisoning claims reach 64.1%.
  • 38.2% of reviewed cases were identified by a clinical panel as having serious potential harm.
  • Existing benchmarks measure what models know but not whether they preserve correct medical judgment under misleading context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the results hold, LLMs may require new defenses against misleading inputs before deployment in medical settings.
  • Real-world patient interactions could reveal even greater vulnerabilities if the benchmark contexts are less varied than natural misinformation.
  • The focus on medical domain suggests similar resilience tests could apply to other high-stakes areas like legal or financial advice.

Load-bearing premise

The misleading contexts constructed for MedMisBench are representative of the kinds of misleading information that could realistically affect LLM medical decision-making.

What would settle it

A study where LLMs are given misleading context in simulated real patient consultations and their advice is checked against actual medical outcomes or expert review.

Figures

Figures reproduced from arXiv: 2606.12291 by Alexandra M. Cardoso Pinto, Bradley Max Segal, David A. Clifton, Dhruv Darji, Fenglin Liu, Hongjian Zhou, Jinge Wu, Joshua Fieggen, Junchi Yu, Jung Moses Koo, Kapil Narain, Laura Sophie Wegner, Lei Clifton, Linda Shapiro, Michael Petrus, Mingde Zeng, Rachel Niesen, Sara Amro, Sean Wu, Sheikh Momin, Tobias Erich Niebuhr, Xinyu Zou.

Figure 1
Figure 1. Figure 1: Focused false context can redirect a correct medical judgment. An authority-framed [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MedMisBench turns clean medical QA into paired resilience tests: filtered answer-grounded [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MedMisBench spans medical reasoning, patient-journey, and agentic tasks. After filtering, 10,932 answer-grounded items remain across 5 source datasets. Selection and filtering. We refer to the injected versions as MEDMISQA, MED￾MISMCQA, MEDMISXPERTQA, MED￾MISJOURNEY, and MEDMISHLE. Across all 5 sources, we retain only items that are answer-grounded, and admit at least 1 semantically valid misleading-contex… view at source ↗
Figure 4
Figure 4. Figure 4: Clean accuracy overstates epistemic resilience. Mean accuracy falls from 71.1% clean to [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Focused delivery drives most resilience loss. Type 1 averages 51.5% ASR versus 18.7% for [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Formal, rule-like falsehoods are most damag [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Clinician review shows clinical harm is common: [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Delivery setting changes the resilience test: [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Applicability filtering makes flips interpretable: accepted items preserve the gold answer [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Option-aligned generation enables targeted attribution. Type 1 selects one wrong-option [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Clinician review draws on ge￾ographically diverse judgment from a 14- member panel spanning 7 countries. Given the limited availability of clinician reviewer time and the need for substantial dual-rater overlap, we use the 89 completed-review tasks as a targeted validation and harm-review sample rather than an exhaustive manual review of the full benchmark. MEDMISJOURNEY is not represented because its ite… view at source ↗
Figure 12
Figure 12. Figure 12: Clinician validation supports benchmark-item quality. Rubric A scores are high for base validity, answer preservation, attack fidelity, and clinical plausibility [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Clinicians review both item validity and response harm using the case, injected context, [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Clean accuracy does not predict resilience: the [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Generator choice does not explain the main pattern: Type 1 remains more damaging than [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Provenance findings are stable to cyclic reassignment. Neutral and authority-like framings [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Evidence gathering helps but is model-dependent. On HLE-only tasks, search sharply [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: The defensive instruction is a lightweight resilience intervention: it warns that added [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Warnings help but are incomplete. The defensive prompt lowers Type 1 ASR by 10.1–14.0 [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
read the original abstract

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that LLMs exhibit low epistemic resilience in medical settings: when author-generated misleading contexts are injected into questions the models originally answer correctly, mean accuracy across 11 configurations drops from 71.1% to 38.0% with 51.5% attack success on the new MedMisBench benchmark (10,932 questions, 48,889 context-option pairs). It further reports that authority-framed and exception-poisoning fabrications are most effective and that a 14-member clinical panel flagged serious potential harm in 38.2% of reviewed cases.

Significance. If the benchmark construction is shown to be representative, the work usefully exposes a gap between existing medical QA benchmarks (which test knowledge) and real-world deployment risks (where patients or external sources may supply misleading information). The large scale of MedMisBench and the inclusion of a multi-country clinical harm review are concrete strengths that could support follow-on safety evaluations.

major comments (2)
  1. [Benchmark construction] Benchmark construction section: the paper provides no validation that the 48,889 author-generated misleading contexts are representative of naturalistic medical misinformation (social media, patient reports, forums). The clinical panel review is limited to harm assessment (38.2%); absence of ecological-validity checks or comparison to observed real-world examples is load-bearing for the claim that the measured 51.5% attack success generalizes to realistic medical decision-making scenarios.
  2. [Experimental results] Experimental results: aggregate accuracy figures (71.1% → 38.0%) and attack rates are presented without reported statistical tests, confidence intervals, per-model variance, or controls for question difficulty between original and misleading conditions. This makes it impossible to assess whether the central drop is robust or driven by a subset of items.
minor comments (1)
  1. [Abstract] The abstract states results across '11 model configurations' but does not list the models or prompting setups; adding this detail would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction section: the paper provides no validation that the 48,889 author-generated misleading contexts are representative of naturalistic medical misinformation (social media, patient reports, forums). The clinical panel review is limited to harm assessment (38.2%); absence of ecological-validity checks or comparison to observed real-world examples is load-bearing for the claim that the measured 51.5% attack success generalizes to realistic medical decision-making scenarios.

    Authors: We agree that the benchmark relies on author-generated contexts without direct empirical matching to real-world medical misinformation distributions. Contexts were constructed to instantiate documented misinformation patterns (authority framing, exception poisoning) drawn from prior literature, but no systematic comparison to scraped social media or forum examples was performed. The clinical panel assessed harm potential only. We will revise the manuscript to add an explicit limitations section clarifying the scope of generalizability claims and noting the absence of ecological validity checks. revision: partial

  2. Referee: [Experimental results] Experimental results: aggregate accuracy figures (71.1% → 38.0%) and attack rates are presented without reported statistical tests, confidence intervals, per-model variance, or controls for question difficulty between original and misleading conditions. This makes it impossible to assess whether the central drop is robust or driven by a subset of items.

    Authors: The accuracy drop is measured on the exact same items that each model answered correctly in the original condition, providing an item-level paired control for question difficulty. We acknowledge the value of additional statistical reporting. In the revision we will add 95% bootstrap confidence intervals for aggregate and per-model metrics, report model-level variance, and include paired statistical tests (e.g., McNemar’s test) for the significance of accuracy changes. revision: yes

standing simulated objections not resolved
  • Empirical validation of the generated misleading contexts against observed real-world medical misinformation distributions from social media, patient reports, or forums.

Circularity Check

0 steps flagged

No significant circularity; results are direct empirical measurements on a new benchmark

full rationale

The paper introduces MedMisBench as a new collection of 10,932 questions and 48,889 author-generated misleading context-option pairs, then directly measures LLM accuracy drops (71.1% to 38.0%) and attack success (51.5%) when the contexts are injected. No equations, fitted parameters, or self-citations appear in the derivation chain; the reported figures are observed outcomes of running the models on the constructed items. The central claim is therefore an empirical measurement rather than a reduction to prior definitions or self-referential inputs. The representativeness of the generated contexts is a separate validity question outside the scope of circularity analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no free parameters, axioms, or invented entities are extractable beyond the implicit assumption that the benchmark measures the intended phenomenon.

axioms (1)
  • domain assumption Constructed misleading contexts validly test epistemic resilience without introducing unrelated biases.
    The benchmark and reported attack success rates rest on this premise.

pith-pipeline@v0.9.1-grok · 5839 in / 1059 out tokens · 23011 ms · 2026-06-27T09:28:48.771176+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 28 canonical work pages · 5 internal anchors

  1. [1]

    The evaluation illusion of large language models in medicine.npj Digital Medicine, 8(1):600, 2025

    Monica Agrawal, Irene Y Chen, Freya Gulamali, and Shalmali Joshi. The evaluation illusion of large language models in medicine.npj Digital Medicine, 8(1):600, 2025

  2. [2]

    Introducing Claude Sonnet 4.6, February 2026

    Anthropic. Introducing Claude Sonnet 4.6, February 2026. URL https://www.anthropic. com/news/claude-sonnet-4-6. Published February 17, 2026

  3. [3]

    Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum.JAMA internal medicine, 183(6):589–596, 2023

    John W Ayers, Adam Poliak, Mark Dredze, Eric C Leas, Zechariah Zhu, Jessica B Kelley, Dennis J Faix, Aaron M Goodman, Christopher A Longhurst, Michael Hogarth, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum.JAMA internal medicine, 183(6):589–596, 2023

  4. [4]

    Testing and evaluation of health care applications of large language models: A systematic review.JAMA, 2025

    Suhana Bedi, Yutong Liu, Lucy Orr-Ewing, Dev Dash, Sanmi Koyejo, Alison Callahan, Jason A. Fries, Michael Wornow, Akshay Swaminathan, Lisa Soleymani Lehmann, Hyo Jung Hong, Mehr Kashyap, Akash R. Chaurasia, Nirav R. Shah, Karandeep Singh, Troy Tazbaz, Arnold Milstein, Michael A. Pfeffer, and Nigam H. Shah. Testing and evaluation of health care application...

  5. [5]

    Felix Busch, Lena Hoffmann, Christopher Rueger, Elon H. C. van Dijk, Rawen Kader, Esteban Ortiz-Prado, Marcus R. Makowski, Luca Saba, Martin Hadamitzky, Jakob Nikolas Kather, Daniel Truhn, Renato Cuocolo, Lisa C. Adams, and Keno K. Bressem. Current applications and challenges in large language models for patient care: A systematic review.Communications Me...

  6. [6]

    Medbench: A large-scale chinese benchmark for evaluating medical large language models

    Yan Cai, Linlin Wang, Ye Wang, Gerard de Melo, Ya Zhang, Yanfeng Wang, and Liang He. Medbench: A large-scale chinese benchmark for evaluating medical large language models. arXiv preprint arXiv:2312.12806, 2023. doi: 10.48550/arXiv.2312.12806

  7. [7]

    Singer, Xuguang Ai, Po-Ting Lai, Zhizheng Wang, et al

    Qingyu Chen, Yan Hu, Xueqing Peng, Qianqian Xie, Qiao Jin, Aidan Gilson, Maxwell B. Singer, Xuguang Ai, Po-Ting Lai, Zhizheng Wang, et al. Benchmarking large language models for biomedical natural language processing applications and recommendations.Nature Communications, 16(1):3280, 2025. doi: 10.1038/s41467-025-56989-2

  8. [8]

    Medriskeval: Medical risk evaluation benchmark of language models, on the importance of user perspectives in healthcare settings

    Jean-Philippe Corbeil, Minseon Kim, Maxime Griot, Sheela Agarwal, Alessandro Sordoni, François Beaulieu, and Paul V ozila. Medriskeval: Medical risk evaluation benchmark of language models, on the importance of user perspectives in healthcare settings. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Lingui...

  9. [9]

    Nour, Seth Spielman, Samuel F

    Beatriz Costa-Gomes, Pavel Tolmachev, Eloise Taysom, Viknesh Sounderajah, Hannah Richard- son, Philipp Schoenegger, Xiaoxuan Liu, Matthew M. Nour, Seth Spielman, Samuel F. Way, Yash Shah, Michael Bhaskar, Harsha Nori, Christopher Kelly, Peter Hames, Bay Gross, Mustafa Suleyman, and Dominic King. Public use of a generalist LLM chatbot for health queries.Na...

  10. [10]

    Openseeker: Democratizing frontier search agents by fully open-sourcing training data.arXiv preprint arXiv:2603.15594, 2026

    Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, and Siheng Chen. Openseeker: Democratizing frontier search agents by fully open-sourcing training data.arXiv preprint arXiv:2603.15594, 2026. doi: 10.48550/arXiv.2603.15594

  11. [11]

    Gemini 3.1 Flash-Lite model card, March 2026

    Google DeepMind. Gemini 3.1 Flash-Lite model card, March 2026. URL https://deepmind .google/models/model-cards/gemini-3-1-flash-lite/. Published March 3, 2026

  12. [12]

    Gemini 3.1 Pro model card, February 2026

    Google DeepMind. Gemini 3.1 Pro model card, February 2026. URL https://deepmind.g oogle/models/model-cards/gemini-3-1-pro/. Published February 19, 2026

  13. [13]

    Gemma 4 model card, April 2026

    Google DeepMind. Gemma 4 model card, April 2026. URL https://ai.google.dev/gemm a/docs/core/model_card_4. Updated April 2, 2026

  14. [14]

    URL https://arxiv.org/abs/2302.12173

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90, 2023. doi: 10.1145/3605764.3623985

  15. [15]

    Bressem, Jakob Niko- las Kather, and Daniel Truhn

    Tianyu Han, Sven Nebelung, Firas Khader, Tianci Wang, Gustav Müller-Franzes, Christiane Kuhl, Sebastian Försch, Jens Kleesiek, Christoph Haarburger, Keno K. Bressem, Jakob Niko- las Kather, and Daniel Truhn. Medical large language models are susceptible to targeted misinformation attacks.npj Digital Medicine, 7:288, 2024. doi: 10.1038/s41746-024-01282-7

  16. [16]

    Medagentbench: Dataset for benchmarking llms as agents in medical applications,

    Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y . Ng, and Jonathan H. Chen. Medagentbench: A realistic virtual EHR environment to benchmark medical LLM agents.arXiv preprint arXiv:2501.14654, 2025. doi: 10.48550/arXiv.2501.14654

  17. [17]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021. doi: 10.3390/app11146421

  18. [18]

    Kovacheva, and Daniel Shu Wei Ting

    Yu He Ke, Liyuan Jin, Kabilan Elangovan, Hairil Rizal Abdullah, Nan Liu, Alex Tiong Heng Sia, Chai Rick Soh, Joshua Yi Min Tung, Jasmine Chiat Ling Ong, Chang-Fu Kuo, Shao-Chun Wu, Vesela P. Kovacheva, and Daniel Shu Wei Ting. Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness.npj Digital Medi...

  19. [19]

    Evaluating clinical competencies of large language models with a general practice benchmark.Nature Communications, 2026

    Zheqing Li, Yiying Yang, Jiping Lang, Wenhao Jiang, Junrong Chen, Yuhang Zhao, Shuang Li, Dingqian Wang, Zhu Lin, et al. Evaluating clinical competencies of large language models with a general practice benchmark.Nature Communications, 2026. doi: 10.1038/s41467-026-71622-6

  20. [20]

    Fenglin Liu, Zheng Li, Hongjian Zhou, Qingyu Yin, Jingfeng Yang, Xianfeng Tang, Chen Luo, Ming Zeng, Haoming Jiang, Yifan Gao, Priyanka Nigam, Sreyashi Nag, Bing Yin, Yining Hua, Xuan Zhou, Omid Rohanian, Anshul Thakur, Lei Clifton, and David A. Clifton. Large language models are poor clinical decision-makers: A comprehensive benchmark. In Proceedings of ...

  21. [21]

    Chen, Yining Hua, Peilin Zhou, Junling Liu, Chengfeng Mao, Chenyu You, Xian Wu, Yefeng Zheng, Lei Clifton, Zheng Li, Jiebo Luo, and David A

    Fenglin Liu, Hongjian Zhou, Boyang Gu, Xinyu Zou, Jinfa Huang, Jinge Wu, Yiru Li, Sam S. Chen, Yining Hua, Peilin Zhou, Junling Liu, Chengfeng Mao, Chenyu You, Xian Wu, Yefeng Zheng, Lei Clifton, Zheng Li, Jiebo Luo, and David A. Clifton. Application of large language models in medicine.Nature Reviews Bioengineering, 3:445–464, 2025. doi: 10.1038/s44222-0...

  22. [22]

    Benchmarking large language models on CMExam - A comprehensive chinese medical exam dataset

    Junling Liu, Peilin Zhou, Yining Hua, Dading Chong, Zhongyu Tian, Andrew Liu, Helin Wang, Chenyu You, Zhenhua Guo, Lei Zhu, and Michael Lingzhi Li. Benchmarking large language models on CMExam - A comprehensive chinese medical exam dataset. InAdvances in Neural Information Processing Systems 36: Datasets and Benchmarks Track, 2023. URL https://proceedings...

  23. [23]

    Carrero, Xiaofeng Jiang, Dyke Ferber, Georg Wölflein, Li Zhang, Sanddhya Jayabalan, Tim Lenz, Zhouguang Hui, et al

    Yunsong Liu, Zunamys I. Carrero, Xiaofeng Jiang, Dyke Ferber, Georg Wölflein, Li Zhang, Sanddhya Jayabalan, Tim Lenz, Zhouguang Hui, et al. Benchmarking large language model- based agent systems for clinical decision tasks.npj Digital Medicine, 9:259, 2026. doi: 10.1038/s41746-026-02443-6

  24. [24]

    Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

    Zizhan Ma, Wenxuan Wang, Guo Yu, Yiu-Fai Cheung, Meidan Ding, Jie Liu, Wenting Chen, and Linlin Shen. Beyond the leaderboard: Rethinking medical benchmarks for large language models.arXiv preprint arXiv:2508.04325, 2025. doi: 10.48550/arXiv.2508.04325

  25. [25]

    Persuasion with Large Language Models: A Survey of Empirical Evidence, Study Methodologies, and Ethical Implications

    Sander Noels, Alexander Rogiers, Maarten Buyl, and Tijl De Bie. Persuasion with large language models: A survey of empirical evidence, study methodologies, and ethical implications. CoRR, abs/2411.06837, 2024. doi: 10.48550/arXiv.2411.06837

  26. [26]

    Capabilities of GPT-4 on Medical Challenge Problems

    Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capa- bilities of GPT-4 on medical challenge problems.arXiv preprint arXiv:2303.13375, 2023. doi: 10.48550/arXiv.2303.13375

  27. [27]

    Wieler, Alexander W

    Mahmud Omar, Vera Sorin, Lothar H. Wieler, Alexander W. Charney, Patricia Kovatch, Carol R. Horowitz, Panagiotis Korfiatis, Benjamin S. Glicksberg, Robert Freeman, Girish N. Nadkarni, and Eyal Klang. Mapping the susceptibility of large language models to medical misinformation across clinical notes and social media: A cross-sectional benchmarking analysis...

  28. [28]

    Introducing HealthBench, May 2025

    OpenAI. Introducing HealthBench, May 2025. URL https://openai.com/index/healt hbench/. Published May 12, 2025

  29. [29]

    Introducing ChatGPT health, January 2026

    OpenAI. Introducing ChatGPT health, January 2026. URLhttps://openai.com/index/i ntroducing-chatgpt-health/. Published January 7, 2026

  30. [30]

    Introducing GPT-5.4, March 2026

    OpenAI. Introducing GPT-5.4, March 2026. URL https://openai.com/index/introdu cing-gpt-5-4/. Published March 5, 2026

  31. [31]

    Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InProceedings of the Conference on Health, Inference, and Learning, volume 174 ofProceedings of Machine Learning Research, pages 248–260, 2022. 12

  32. [32]

    and Askell, Amanda and Grosse, Roger and Hernandez, Danny and Ganguli, Deep and Hubinger, Evan and Schiefer, Nicholas and Kaplan, Jared

    Ethan Perez, Sam Ringer, Kamile Lukosiute, et al. Discovering language model behaviors with model-written evaluations. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434, 2023. doi: 10.18653/v1/2023.findings-acl.847

  33. [33]

    A benchmark of expert-level academic questions to assess ai capabilities.Nature, 649(8099):1139–1146, 2026

    Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dan Hendrycks, et al. A benchmark of expert-level academic questions to assess ai capabilities.Nature, 649(8099):1139–1146, 2026

  34. [34]

    Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026

    Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026. URL https://qwen.ai/blog?id=qwen3.6-35b-a3b. Published April 16, 2026

  35. [35]

    Rao, Kaiz P

    Arya S. Rao, Kaiz P. Esmail, Richard S. Lee, et al. Large language model performance and clinical reasoning tasks.JAMA Network Open, 9(4):e264003, 2026. doi: 10.1001/jamanetwor kopen.2026.4003

  36. [36]

    AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

    Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: A multimodal agent benchmark to evaluate AI in simulated clinical environments. arXiv preprint arXiv:2405.07960, 2024. doi: 10.48550/arXiv.2405.07960

  37. [37]

    Medgemma technical report.arXiv preprint arXiv:2507.05201,

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201,

  38. [38]

    doi: 10.48550/arXiv.2507.05201

  39. [39]

    Bowman, Esin Durmus, Zac Hatfield-Dodds, Scott R

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna M. Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models. InThe Twelfth Internat...

  40. [40]

    Nature , volume=

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023. doi: 10.1038/s41586-023-06291-2

  41. [41]

    Toward expert-level medical question answering with large language models.Nature medicine, 31(3):943–950, 2025

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature medicine, 31(3):943–950, 2025

  42. [42]

    Large language models in medicine

    Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature Medicine, 29(8):1930–1940, 2023. doi: 10.1038/s41591-023-02448-8

  43. [43]

    Scientists invented a fake disease

    Almira Osmanovic Thunström. Scientists invented a fake disease. AI told people it was real. Nature, 652:559, 2026

  44. [44]

    Nature , author =

    Tao Tu, Mike Schaekermann, Anil Palepu, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Yong Cheng, Elahe Vedadi, Nenad Tomasev, Shekoofeh Azizi, Albert Webson, S. Sara Mahdavi, Joelle Barral, Karan Singhal, Le Hou, Kavita Kulkarni, Christopher Semturs, Juraj Gottweis, Katherine Chou, Greg S. Corrado, Yossi Matias, Alan Karthi...

  45. [45]

    A novel evaluation benchmark for medical llms illuminating safety and effectiveness in clinical domains.npj Digital Medicine, 2025

    Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu, Hongyang Ma, Yongxin Wang, Wubin Sun, Zeliang Lian, Kehang Mao, et al. A novel evaluation benchmark for medical llms illuminating safety and effectiveness in clinical domains.npj Digital Medicine, 2025

  46. [46]

    Understanding the infodemic and misinformation in the fight against COVID-19, 2026

    World Health Organization. Understanding the infodemic and misinformation in the fight against COVID-19, 2026. URL https://www.who.int/health-topics/infodemic/understa nding-the-infodemic-and-misinformation-in-the-fight-against-covid-19 . Accessed April 18, 2026. 13

  47. [47]

    Medjourney: Benchmark and evaluation of large language models over patient clinical journey.Advances in Neural Information Processing Systems, 37:87621–87646, 2024

    Xian Wu, Yutian Zhao, Yunyan Zhang, Jiageng Wu, Zhihong Zhu, Yingying Zhang, Yi Ouyang, Ziheng Zhang, Huimin Wang, Zhenxi Lin, et al. Medjourney: Benchmark and evaluation of large language models over patient clinical journey.Advances in Neural Information Processing Systems, 37:87621–87646, 2024

  48. [48]

    Bitterman, Jasmine Chiat Ling Ong, Daniel Shu Wei Ting, Serena Hong, and Nan Liu

    Rui Yang, Yilin Ning, Emilia Keppo, Mingxuan Liu, Chuan Hong, Danielle S. Bitterman, Jasmine Chiat Ling Ong, Daniel Shu Wei Ting, Serena Hong, and Nan Liu. Retrieval-augmented generation for generative artificial intelligence in health care.npj Health Systems, 2:2, 2025. doi: 10.1038/s44401-024-00004-1

  49. [49]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?i d=WE_vluYUL-X

  50. [50]

    PoisonedRAG: Knowledge corruption attacks to Retrieval-Augmented generation of large language models

    Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. PoisonedRAG: Knowledge corruption attacks to Retrieval-Augmented generation of large language models. In34th USENIX Security Symposium (USENIX Security 25), pages 3827–3844, 2025

  51. [51]

    is_viable

    Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025. 14 Appendices A Benchmark Scope and Construction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....