DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs

Benedikt Wiestler; Daniel Rueckert; Felix Bitzer; Jiancheng Yang; Jiazhen Pan; Julian Canisius; Jun Li; Paula Ro{\ss}m\"uller; Virginie Kreutzinger; Weixiang Shen

arxiv: 2605.23629 · v1 · pith:CJLLDCTGnew · submitted 2026-05-22 · 💻 cs.CV

DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs

Jiazhen Pan , Weixiang Shen , Jun Li , Julian Canisius , Felix Bitzer , Paula Ro{\ss}m\"uller , Jiancheng Yang , Virginie Kreutzinger

show 2 more authors

Daniel Rueckert Benedikt Wiestler

This is my paper

Pith reviewed 2026-05-25 04:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical diagnosis evaluationdiagnostic trajectoriesvision-language modelsbenchmarkneuroradiologysequential reasoningevidence gatheringworkup quality

0 comments

The pith

Medical AI benchmarks that score only final diagnoses can mask unsupported guesses and inefficient workups by models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current medical AI tests usually hand models complete case details and judge only the final diagnosis. This setup hides problems such as models reaching the right answer without key evidence, requesting studies but misreading the images, or gathering data inefficiently while updating their beliefs poorly. The paper introduces DDX-TRACE, a benchmark of 211 neuroradiology cases that starts models with limited history and requires them to request imaging studies in sequence, interpret the returned images, revise a probabilistic differential diagnosis after each step, and stop when confident. Physician adjudication of hidden evidence lets the benchmark separate good final answers from good diagnostic processes. If the claim holds, evaluation must track the full trajectory rather than the endpoint alone to reflect real clinical requirements.

Core claim

The paper claims that final diagnosis scores substantially misrepresent workup quality because models may guess plausible diagnoses without essential evidence, request useful studies but misinterpret raw images, or acquire evidence inefficiently while updating uncertainty poorly. DDX-TRACE evaluates state-of-the-art vision-language models on full diagnostic trajectories under a hidden-evidence protocol over 211 challenging cases, where each case begins with limited clinical history, models request studies freely, receive matched image bundles, update probabilistic differentials, and conclude with a localized diagnosis. Controlled evidence variants isolate bottlenecks in planning, visual证据提取,

What carries the argument

DDX-TRACE benchmark that uses a physician-adjudicated hidden-evidence protocol to evaluate sequential diagnostic trajectories instead of final answers alone.

If this is right

High final diagnosis scores do not guarantee that models requested or used essential evidence.
Visual interpretation of raw images can fail even when appropriate studies are requested.
Evidence acquisition can remain inefficient while uncertainty updating stays poor.
Controlled variants of available evidence can separate failures in planning from failures in visual extraction and differential reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training methods for medical vision-language models could shift from rewarding correct endpoints to rewarding evidence-supported sequences.
Trajectory evaluation might be required before safe clinical deployment to avoid models that reach answers through unsupported paths.
Similar hidden-evidence benchmarks could be built for other specialties to test whether the same mismatch between final score and process quality appears.

Load-bearing premise

The 211 cases and the physician-adjudicated hidden-evidence protocol accurately capture real clinical diagnostic trajectories.

What would settle it

If models with high final diagnosis scores on DDX-TRACE consistently request the same essential studies and interpret the images in line with the physician-adjudicated evidence paths, that observation would undermine the claim that final scores substantially misrepresent workup quality.

Figures

Figures reproduced from arXiv: 2605.23629 by Benedikt Wiestler, Daniel Rueckert, Felix Bitzer, Jiancheng Yang, Jiazhen Pan, Julian Canisius, Jun Li, Paula Ro{\ss}m\"uller, Virginie Kreutzinger, Weixiang Shen.

**Figure 1.** Figure 1: DDX-TRACE overview. A) Conventional medical benchmarks often reveal all the relevant evidence upfront and score only the final answer, making it difficult to detect unsupported correct guesses, premature closure, over-testing, or poor belief updating. B) DDX-TRACE instead starts from a limited history and requires the model to request imaging evidence sequentially, update a probabilistic differential diagn… view at source ↗

**Figure 2.** Figure 2: Endpoint versus process-aware ranking. Endpoint rank uses Sdx. Process rank is used only for visualization and is computed as the mean of SER, Sorder, and Straj. This separation motivates route-aware evaluation. A correct final diagnosis may still be reached after missing essential evidence, requesting studies in a poor order, or stopping before the workup is sufficient. Conversely, useful evidence acqui… view at source ↗

**Figure 3.** Figure 3: Endpoint-pass/workup-fail audit. A case-level trace contrasts final diagnostic credit with evidence acquisition, ordering, and clinical-sufficiency checks, showing how a correct answer can still arise from an insufficient diagnostic route. Finding 3: Correct diagnostic guesses are rarely supported by complete essential evidence [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: DDX-TRACE overview. Each case begins with limited clinical history only, and the agent does not receive a list of available examinations. The agent interacts with the environment by requesting imaging exams in free form, observing newly revealed image bundles, updating a differential diagnosis list with probabilities after every turn, and finally producing a localized diagnosis. The benchmark evaluates end… view at source ↗

**Figure 5.** Figure 5: Distribution of the official release/evaluation cases across the main dataset attributes, includ [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of imaging evidence and related benchmark characteristics, including case [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Benchmark gap under passive evaluation with Sdx as the metric. When models are given all images at once or oracle text findings instead of having to actively request and interpret evidence, their apparent performance improves. These passive settings remove the need for clinically grounded evidence acquisition, planning, and intermediate belief updating, and therefore obscure the gap between strong surface-… view at source ↗

**Figure 8.** Figure 8: Benchmark gap comparing passive endpoint score and active trajectory score. This panel compares passive endpoint diagnosis score (Sdx) against active-workup trajectory score (Straj). It should therefore be interpreted as a cross-metric diagnostic-process comparison, not as a withinmetric passive-versus-active ablation. Passive settings remove the need for clinically grounded evidence acquisition, planning… view at source ↗

**Figure 9.** Figure 9: shows that sequential evidence can improve endpoint performance, especially for frontier models, but endpoint improvement and clinical sufficiency are not equivalent. Some models stop early or request little without completing the workup; others continue requesting evidence without translating it into better diagnoses. The desired behavior is therefore not simply fewer or more requests, but a correct local… view at source ↗

**Figure 10.** Figure 10: Efficiency–accuracy Pareto frontier. Endpoint diagnostic performance (Sdx) versus the number of requested exams. MedGemma 1.5 4B, GPT-5.4 Mini, and Gemini 3.1 Pro lie on the plotted Pareto frontier: Gemini 3.1 Pro achieves the strongest diagnostic performance among these frontier points but requires relatively more exams, whereas GPT-5.4 Mini occupies a lower-request point on the same frontier [PITH_FULL… view at source ↗

**Figure 11.** Figure 11: Slice analysis by physician-rated rarity and difficulty. Bars report the mean performance across five representative models: Gemini 3.1 Pro, Gemini 3 Flash, GPT-5.4, GPT-5.4 Mini, and MedGemma 27B. Model performance remains relatively stable across cases annotated as more common versus rarer and easier versus harder for human experts. The extreme-hard and commonrarity slices are small and should be inter… view at source ↗

**Figure 12.** Figure 12: Calibration analysis overview. Reliability diagrams for frontier models, open-weight models, and medical/radiology-adapted models using the final top-1 probability against the normalized diagnosis score (Sdx). 22 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Physician annotation interface. Screenshot of the web-based review platform used to annotate case-level and exam-level benchmark metadata. Annotators label the importance of intermediate imaging steps, specify preferred exam order, assess case rarity and difficulty, and correct metadata or template artifacts, including modality, acquisition, view, imaged region, temporal context, contrast usage, and rubri… view at source ↗

read the original abstract

Medical diagnosis is not a single prediction from a fully specified vignette. It is a sequential workup: clinicians decide what evidence to obtain, revise a differential diagnosis, and stop when the diagnosis is sufficiently supported. Most medical AI benchmarks instead reveal the relevant context upfront and score only the final answer, making unsupported correct guesses, premature closure, inefficient workups, and poor uncertainty updating invisible. We introduce DDX-TRACE, a physician-adjudicated benchmark for multimodal neuroradiology that evaluates diagnostic trajectories under hidden evidence over 211 challenging cases. Each case begins with limited clinical history; models request imaging studies in free form, receive matched image bundles when available, update a probabilistic differential diagnosis after each turn, and stop with a localized final diagnosis. Evaluating state-of-the-art VLMs, we find that final diagnosis scores can substantially misrepresent workup quality: models may guess plausible diagnoses without essential evidence, request useful studies but misinterpret raw images, or acquire evidence inefficiently while updating uncertainty poorly. Controlled evidence variants isolate bottlenecks in planning, visual evidence extraction, and downstream differential reasoning. DDX-TRACE shifts medical AI evaluation from final answers to evidence-supported diagnostic trajectories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DDX-TRACE introduces a benchmark for sequential diagnostic workups in VLMs with hidden evidence and free-form requests, which is a clear step past final-answer scoring, but the isolation of bottlenecks depends on unshown validation of the adjudication.

read the letter

The paper's main contribution is DDX-TRACE, a benchmark of 211 neuroradiology cases that starts models with limited history, lets them request studies in free form, supplies matched image bundles, requires probabilistic differential updates after each turn, and has them stop with a localized diagnosis. Controlled variants are meant to separate planning, visual extraction, and reasoning failures. This setup directly addresses the gap the abstract describes: final diagnosis scores can reward guesses that skip essential evidence or hide inefficient or misinterpreted workups. That framing is useful and matches how real diagnostic work actually proceeds. The design itself is a solid piece of engineering for the stated goal. The soft spot is the physician-adjudicated labeling of essential evidence and stopping criteria. The abstract gives no inter-rater statistics, no sensitivity checks on the thresholds, and no detail on how free-form requests map to bundles. If those labels carry case-by-case judgment noise, the variant results will not cleanly isolate the three bottlenecks the paper claims to target. Without those numbers in the full text, the central claim that final scores substantially misrepresent workup quality rests on an assumption that still needs evidence. This is for groups working on medical VLM evaluation and training who want benchmarks closer to clinical sequences. It deserves peer review because the problem it names is real and the benchmark idea is workable, even if the adjudication protocol will need tighter documentation and checks before the results can be taken at face value.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces DDX-TRACE, a physician-adjudicated benchmark of 211 neuroradiology cases that evaluates VLMs on sequential diagnostic trajectories under hidden evidence. Models begin with limited clinical history, issue free-form study requests, receive matched image bundles, update probabilistic differentials after each turn, and terminate with a localized final diagnosis. The central claims are that final-diagnosis scores substantially misrepresent workup quality (e.g., correct guesses without essential evidence, misinterpretation of raw images, inefficient evidence acquisition with poor uncertainty updating) and that controlled evidence variants can isolate bottlenecks in planning, visual extraction, and differential reasoning.

Significance. If the adjudication protocol and variant design are shown to be robust, the benchmark would represent a meaningful advance over static final-answer medical AI evaluations by surfacing clinically relevant failure modes that current leaderboards conceal. The shift toward trajectory-based assessment is a substantive contribution to the field.

major comments (2)

[Abstract / Benchmark Design] Abstract and Benchmark Design section: The claim that controlled evidence variants isolate specific bottlenecks in planning, visual evidence extraction, and downstream differential reasoning rests on the hidden-evidence protocol and physician adjudication of 'essential evidence' and stopping criteria being accurate and reproducible. The manuscript provides no inter-rater agreement statistics, no sensitivity analysis on adjudication thresholds, and no explicit mapping procedure from free-form study requests to matched bundles; without these, apparent model failures in the three bottleneck categories could be artifacts of adjudication noise rather than genuine model deficiencies.
[Evaluation and Results] Evaluation and Results section: The abstract states that 'evaluating state-of-the-art VLMs, we find that final diagnosis scores can substantially misrepresent workup quality' yet supplies no quantitative results, error analysis, or validation details (e.g., per-bottleneck performance deltas, inter-case variance, or comparison against physician trajectories). This absence directly undermines the load-bearing empirical support for the misrepresentation claim.

minor comments (1)

[Abstract] The abstract would benefit from at least one summary statistic (e.g., aggregate accuracy gap between final diagnosis and trajectory quality) to ground the high-level findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive review of our manuscript. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract / Benchmark Design] Abstract and Benchmark Design section: The claim that controlled evidence variants isolate specific bottlenecks in planning, visual evidence extraction, and downstream differential reasoning rests on the hidden-evidence protocol and physician adjudication of 'essential evidence' and stopping criteria being accurate and reproducible. The manuscript provides no inter-rater agreement statistics, no sensitivity analysis on adjudication thresholds, and no explicit mapping procedure from free-form study requests to matched bundles; without these, apparent model failures in the three bottleneck categories could be artifacts of adjudication noise rather than genuine model deficiencies.

Authors: We agree that quantitative validation of the adjudication protocol is necessary to support the bottleneck isolation claims. The manuscript describes the multi-physician adjudication process for essential evidence and stopping criteria but does not include inter-rater agreement statistics or sensitivity analyses. We will add these in the revision, including Fleiss' kappa for key decisions and a sensitivity analysis on adjudication thresholds. We will also expand the description of the mapping procedure from free-form study requests to matched image bundles. revision: yes
Referee: [Evaluation and Results] Evaluation and Results section: The abstract states that 'evaluating state-of-the-art VLMs, we find that final diagnosis scores can substantially misrepresent workup quality' yet supplies no quantitative results, error analysis, or validation details (e.g., per-bottleneck performance deltas, inter-case variance, or comparison against physician trajectories). This absence directly undermines the load-bearing empirical support for the misrepresentation claim.

Authors: The Evaluation and Results section presents quantitative comparisons between final diagnosis accuracy and trajectory quality metrics. We acknowledge that additional error analysis and validation details would strengthen the empirical support. We will expand the section to include per-bottleneck performance deltas, inter-case variance, and comparisons to physician trajectories on a subset of cases where feasible. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark is an independent evaluation framework

full rationale

The paper presents DDX-TRACE as a new physician-adjudicated benchmark for diagnostic trajectories, with no mathematical derivations, equations, fitted parameters, or predictions that reduce to inputs by construction. No self-citations are invoked as load-bearing for uniqueness theorems or ansatzes. The central claims rest on empirical evaluation of existing VLMs against the benchmark rather than any self-referential reduction. The 211 cases and adjudication protocol are described as external to the models being tested, making the framework self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on the domain assumption that medical diagnosis is inherently sequential and that simulated hidden evidence can validly test real-world capabilities; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Medical diagnosis is a sequential process of evidence gathering, differential diagnosis updating, and stopping when sufficiently supported.
This underpins the entire benchmark design and evaluation protocol described in the abstract.

pith-pipeline@v0.9.0 · 5775 in / 1322 out tokens · 59020 ms · 2026-05-25T04:51:42.751292+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 7 internal anchors

[1]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Holistic evaluation of large language models for medical tasks with medhelm.Nature Medicine, pages 1–9, 2026

Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, et al. Holistic evaluation of large language models for medical tasks with medhelm.Nature Medicine, pages 1–9, 2026

work page 2026
[3]

Red teaming chatgpt in medicine to yield real-world insights on model behavior.npj Digital Medicine, 8(1):149, 2025

Crystal T Chang, Hodan Farah, Haiwen Gui, Shawheen Justin Rezaei, Charbel Bou-Khalil, Ye-Jean Park, Akshay Swaminathan, Jesutofunmi A Omiye, Akaash Kolluri, Akash Chaurasia, et al. Red teaming chatgpt in medicine to yield real-world insights on model behavior.npj Digital Medicine, 8(1):149, 2025

work page 2025
[4]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. Huatuogpt-o1, towards medical complex reasoning with llms.arXiv preprint arXiv:2412.18925, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Towards injecting medical visual knowledge into multimodal llms at scale

Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, et al. Towards injecting medical visual knowledge into multimodal llms at scale. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 7346–7370, 2024

work page 2024
[6]

A vision- language foundation model to enhance efficiency of chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

Zhihong Chen, Maya Varma, Justin Xu, Magdalini Paschali, Dave Van Veen, Andrew Johnston, Alaa Youssef, Louis Blankemeier, Christian Bluethgen, Stephan Altmayer, et al. A vision- language foundation model to enhance efficiency of chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

work page arXiv 2024
[7]

Simulating viva voce examinations to evaluate clinical reasoning in large language models.arXiv preprint arXiv:2510.10278, 2025

Christopher Chiu, Silviu Pitis, and Mihaela van der Schaar. Simulating viva voce examinations to evaluate clinical reasoning in large language models.arXiv preprint arXiv:2510.10278, 2025

work page arXiv 2025
[8]

Eurorad: The radiological case database

European Society of Radiology. Eurorad: The radiological case database

work page
[9]

Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator

Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Wang Siyuan, Zhongyu Wei, and Fei Huang. Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator. InProceedings of the 31st International Conference on Computational Linguistics, pages 10183–10213, 2025

work page 2025
[10]

Gemma 3 technical report, 2025

Gemma Team. Gemma 3 technical report, 2025

work page 2025
[11]

Evalu- ation and mitigation of the limitations of large language models in clinical decision-making

Paul Hager, Friederike Jungmann, Robbie Holland, Kunal Bhagat, Inga Hubrecht, Manuel Knauer, Jakob Vielhauer, Marcus Makowski, Rickmer Braren, Georgios Kaissis, et al. Evalu- ation and mitigation of the limitations of large language models in clinical decision-making. Nature medicine, 30(9):2613–2622, 2024

work page 2024
[12]

Vision-language models for medical report generation and visual question answering: A review.Frontiers in artificial intelligence, 7:1430984, 2024

Iryna Hartsock and Ghulam Rasool. Vision-language models for medical report generation and visual question answering: A review.Frontiers in artificial intelligence, 7:1430984, 2024

work page 2024
[13]

PathVQA: 30000+ Questions for Medical Visual Question Answering

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2003
[14]

Medagentbench: a virtual ehr environment to benchmark medical llm agents

Yixing Jiang, Kameron C Black, Gloria Geng, Danny Park, James Zou, Andrew Y Ng, and Jonathan H Chen. Medagentbench: a virtual ehr environment to benchmark medical llm agents. Nejm Ai, 2(9):AIdbp2500144, 2025

work page 2025
[15]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

work page 2021
[16]

Pubmedqa: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2567–2577, 2019. 11

work page 2019
[17]

Medcalc- bench: Evaluating large language models for medical calculations.Advances in Neural Infor- mation Processing Systems, 37:84730–84745, 2024

Nikhil Khandekar, Qiao Jin, Guangzhi Xiong, Soren Dunn, Serina S Applebaum, Zain Anwar, Maame Sarfo-Gyamfi, Conrad W Safranek, Abid A Anwar, Andrew Zhang, et al. Medcalc- bench: Evaluating large language models for medical calculations.Advances in Neural Infor- mation Processing Systems, 37:84730–84745, 2024

work page 2024
[18]

A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

work page 2018
[19]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

work page 2023
[20]

Medical visual question answering: A survey.Artificial Intelligence in Medicine, 143:102611, 2023

Zhihong Lin, Donghao Zhang, Qingyi Tao, Danli Shi, Gholamreza Haffari, Qi Wu, Mingguang He, and Zongyuan Ge. Medical visual question answering: A survey.Artificial Intelligence in Medicine, 143:102611, 2023

work page 2023
[21]

Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

work page 2021
[22]

Sequential diagnosis with language models.arXiv preprint arXiv:2506.22405, 2025

Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, et al. Sequential diagnosis with language models.arXiv preprint arXiv:2506.22405, 2025

work page arXiv 2025
[23]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on health, inference, and learning, pages 248–260. PMLR, 2022

work page 2022
[24]

Beyond benchmarks: Dynamic, automatic and systematic red-teaming agents for trustworthy medical language models.arXiv preprint arXiv:2508.00923, 2025

Jiazhen Pan, Bailiang Jian, Paul Hager, Yundi Zhang, Che Liu, Friedrike Jungmann, Hong- wei Bran Li, Chenyu You, Junde Wu, Jiayuan Zhu, et al. Beyond benchmarks: Dynamic, automatic and systematic red-teaming agents for trustworthy medical language models.arXiv preprint arXiv:2508.00923, 2025

work page arXiv 2025
[25]

Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning

Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 337–347. Springer, 2025

work page 2025
[26]

Towards building multilingual language model for medicine

Pengcheng Qiu, Chaoyi Wu, Xiaoman Zhang, Weixiong Lin, Haicheng Wang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards building multilingual language model for medicine. Nature Communications, 15(1):8384, 2024

work page 2024
[27]

Qwen3.5-35B-A3B model card

Qwen Team. Qwen3.5-35B-A3B model card. Hugging Face model card, 2026. https: //huggingface.co/Qwen/Qwen3.5-35B-A3B(accessed March 2026)

work page 2026
[28]

Qwen3.5 model collection

Qwen Team. Qwen3.5 model collection. Hugging Face model collection, 2026. https: //huggingface.co/collections/Qwen/qwen35(accessed March 2026)

work page 2026
[29]

Capabilities of Gemini Models in Medicine

Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023. 12

work page 2023
[32]

Toward expert-level medical question answering with large language models.Nature Medicine, pages 1–8, 2025

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, pages 1–8, 2025

work page 2025
[33]

Collaboration between clinicians and vision–language models in radiology report generation.Nature Medicine, 31(2):599–608, 2025

Ryutaro Tanno, David GT Barrett, Andrew Sellergren, Sumedh Ghaisas, Sumanth Dathathri, Abigail See, Johannes Welbl, Charles Lau, Tao Tu, Shekoofeh Azizi, et al. Collaboration between clinicians and vision–language models in radiology report generation.Nature Medicine, 31(2):599–608, 2025

work page 2025
[34]

Medcasereasoning: Evaluating and learning diagnostic reasoning from clinical case reports.arXiv preprint arXiv:2505.11733, 2025

Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J Tao, Min Woo Sun, Alejandro Lozano, and James Zou. Medcasereasoning: Evaluating and learning diagnostic reasoning from clinical case reports.arXiv preprint arXiv:2505.11733, 2025

work page arXiv 2025
[35]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Cheng- hao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foun- dation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Medical thinking with multiple images

Zonghai Yao, Benlu Wang, Yifan Zhang, Junda Wang, Iris Xia, Zhipeng Tang, Shuo Han, Feiyun Ouyang, Zhichao Yang, Arman Cohan, and Hong Yu. Medical thinking with multiple images. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[37]

Huatuogpt, towards taming language model to be a doctor

Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Guiming Chen, Jianquan Li, Xiangbo Wu, Zhang Zhiyi, Qingying Xiao, et al. Huatuogpt, towards taming language model to be a doctor. InFindings of the association for computational linguistics: EMNLP 2023, pages 10859–10885, 2023

work page 2023
[38]

An agentic system for rare disease diagnosis with traceable reasoning.Nature, pages 1–10, 2026

Weike Zhao, Chaoyi Wu, Yanjie Fan, Pengcheng Qiu, Xiaoman Zhang, Yuze Sun, Xiao Zhou, Shuju Zhang, Yu Peng, Yanfeng Wang, et al. An agentic system for rare disease diagnosis with traceable reasoning.Nature, pages 1–10, 2026

work page 2026
[39]

Ask patients with patience: Enabling llms for human-centric medical dialogue with grounded reasoning

Jiayuan Zhu, Jiazhen Pan, Yuyuan Liu, Fenglin Liu, and Junde Wu. Ask patients with patience: Enabling llms for human-centric medical dialogue with grounded reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2846–2857, 2025

work page 2025
[40]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025. 13 Appendix A Pictorial Illustration of DDX-TRACE Workflow 15 B Limitations 15 C Benchmark Statistics and Data Distribution 16 D Extend...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Confidence Alignment 𝒕𝒏𝒕"𝒕# 𝒕𝒏𝒕"𝒕# Hidden Protocol Pool \ MRI T1 Mapping CTA / MRA… MRIFLAIR Ultrasound Unlocked studies Inactivated studies …… Figure 4:DDX-TRACE overview.Each case begins with limited clinical history only, and the agent does not receive a list of available examinations. The agent interacts with the environment by requesting imaging exam...

work page
[42]

Based on the imaging figures provided, what is the most likely diagnosis?

Candidate-model decoding settings are reported separately in Appendix J.1. For each case, the judge receives the case-specific diagnosis and localization rubrics, the model’s final output, and the set of unique diagnosis strings that appeared anywhere in the trajectory. It returns normalized endpoint scores for diagnosis and localization, together with ex...

work page
[43]

Localisation -- where the abnormality is

work page
[44]

3", "2",

Diagnosis -- the most likely diagnosis. REQUIREMENTS - Output must be a single JSON object that strictly matches the provided JSON Schema. - Provide analytic criteria for score levels "3", "2", "1", "0" in EACH of the sections (Localisation / Diagnosis). - Include a **reference_answer** for EACH section: - Localisation.reference_answer must be an object w...

work page
[45]

final diagnosis quality using the provided case-specific diagnosis rubric

work page
[46]

final localization quality using the provided case-specific localization rubric

work page
[47]

final four-item differential-list quality using the reference differential set and the global rubric below

work page
[48]

final_scores

exact/acceptable/unmatched labels and 0-3 diagnosis-rubric scores for every diagnosis string in the trajectory. Global rubric for final differential-list quality (0-3): - 0: The list is mostly off-target, fails to include the final diagnosis or close equivalent, and has little overlap with the reference differential set. - 1: The list contains one or more...

work page 2026

[1] [1]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Holistic evaluation of large language models for medical tasks with medhelm.Nature Medicine, pages 1–9, 2026

Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, et al. Holistic evaluation of large language models for medical tasks with medhelm.Nature Medicine, pages 1–9, 2026

work page 2026

[3] [3]

Red teaming chatgpt in medicine to yield real-world insights on model behavior.npj Digital Medicine, 8(1):149, 2025

Crystal T Chang, Hodan Farah, Haiwen Gui, Shawheen Justin Rezaei, Charbel Bou-Khalil, Ye-Jean Park, Akshay Swaminathan, Jesutofunmi A Omiye, Akaash Kolluri, Akash Chaurasia, et al. Red teaming chatgpt in medicine to yield real-world insights on model behavior.npj Digital Medicine, 8(1):149, 2025

work page 2025

[4] [4]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. Huatuogpt-o1, towards medical complex reasoning with llms.arXiv preprint arXiv:2412.18925, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Towards injecting medical visual knowledge into multimodal llms at scale

Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, et al. Towards injecting medical visual knowledge into multimodal llms at scale. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 7346–7370, 2024

work page 2024

[6] [6]

A vision- language foundation model to enhance efficiency of chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

Zhihong Chen, Maya Varma, Justin Xu, Magdalini Paschali, Dave Van Veen, Andrew Johnston, Alaa Youssef, Louis Blankemeier, Christian Bluethgen, Stephan Altmayer, et al. A vision- language foundation model to enhance efficiency of chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

work page arXiv 2024

[7] [7]

Simulating viva voce examinations to evaluate clinical reasoning in large language models.arXiv preprint arXiv:2510.10278, 2025

Christopher Chiu, Silviu Pitis, and Mihaela van der Schaar. Simulating viva voce examinations to evaluate clinical reasoning in large language models.arXiv preprint arXiv:2510.10278, 2025

work page arXiv 2025

[8] [8]

Eurorad: The radiological case database

European Society of Radiology. Eurorad: The radiological case database

work page

[9] [9]

Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator

Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Wang Siyuan, Zhongyu Wei, and Fei Huang. Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator. InProceedings of the 31st International Conference on Computational Linguistics, pages 10183–10213, 2025

work page 2025

[10] [10]

Gemma 3 technical report, 2025

Gemma Team. Gemma 3 technical report, 2025

work page 2025

[11] [11]

Evalu- ation and mitigation of the limitations of large language models in clinical decision-making

Paul Hager, Friederike Jungmann, Robbie Holland, Kunal Bhagat, Inga Hubrecht, Manuel Knauer, Jakob Vielhauer, Marcus Makowski, Rickmer Braren, Georgios Kaissis, et al. Evalu- ation and mitigation of the limitations of large language models in clinical decision-making. Nature medicine, 30(9):2613–2622, 2024

work page 2024

[12] [12]

Vision-language models for medical report generation and visual question answering: A review.Frontiers in artificial intelligence, 7:1430984, 2024

Iryna Hartsock and Ghulam Rasool. Vision-language models for medical report generation and visual question answering: A review.Frontiers in artificial intelligence, 7:1430984, 2024

work page 2024

[13] [13]

PathVQA: 30000+ Questions for Medical Visual Question Answering

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2003

[14] [14]

Medagentbench: a virtual ehr environment to benchmark medical llm agents

Yixing Jiang, Kameron C Black, Gloria Geng, Danny Park, James Zou, Andrew Y Ng, and Jonathan H Chen. Medagentbench: a virtual ehr environment to benchmark medical llm agents. Nejm Ai, 2(9):AIdbp2500144, 2025

work page 2025

[15] [15]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

work page 2021

[16] [16]

Pubmedqa: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2567–2577, 2019. 11

work page 2019

[17] [17]

Medcalc- bench: Evaluating large language models for medical calculations.Advances in Neural Infor- mation Processing Systems, 37:84730–84745, 2024

Nikhil Khandekar, Qiao Jin, Guangzhi Xiong, Soren Dunn, Serina S Applebaum, Zain Anwar, Maame Sarfo-Gyamfi, Conrad W Safranek, Abid A Anwar, Andrew Zhang, et al. Medcalc- bench: Evaluating large language models for medical calculations.Advances in Neural Infor- mation Processing Systems, 37:84730–84745, 2024

work page 2024

[18] [18]

A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

work page 2018

[19] [19]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

work page 2023

[20] [20]

Medical visual question answering: A survey.Artificial Intelligence in Medicine, 143:102611, 2023

Zhihong Lin, Donghao Zhang, Qingyi Tao, Danli Shi, Gholamreza Haffari, Qi Wu, Mingguang He, and Zongyuan Ge. Medical visual question answering: A survey.Artificial Intelligence in Medicine, 143:102611, 2023

work page 2023

[21] [21]

Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

work page 2021

[22] [22]

Sequential diagnosis with language models.arXiv preprint arXiv:2506.22405, 2025

Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, et al. Sequential diagnosis with language models.arXiv preprint arXiv:2506.22405, 2025

work page arXiv 2025

[23] [23]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on health, inference, and learning, pages 248–260. PMLR, 2022

work page 2022

[24] [24]

Beyond benchmarks: Dynamic, automatic and systematic red-teaming agents for trustworthy medical language models.arXiv preprint arXiv:2508.00923, 2025

Jiazhen Pan, Bailiang Jian, Paul Hager, Yundi Zhang, Che Liu, Friedrike Jungmann, Hong- wei Bran Li, Chenyu You, Junde Wu, Jiayuan Zhu, et al. Beyond benchmarks: Dynamic, automatic and systematic red-teaming agents for trustworthy medical language models.arXiv preprint arXiv:2508.00923, 2025

work page arXiv 2025

[25] [25]

Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning

Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 337–347. Springer, 2025

work page 2025

[26] [26]

Towards building multilingual language model for medicine

Pengcheng Qiu, Chaoyi Wu, Xiaoman Zhang, Weixiong Lin, Haicheng Wang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards building multilingual language model for medicine. Nature Communications, 15(1):8384, 2024

work page 2024

[27] [27]

Qwen3.5-35B-A3B model card

Qwen Team. Qwen3.5-35B-A3B model card. Hugging Face model card, 2026. https: //huggingface.co/Qwen/Qwen3.5-35B-A3B(accessed March 2026)

work page 2026

[28] [28]

Qwen3.5 model collection

Qwen Team. Qwen3.5 model collection. Hugging Face model collection, 2026. https: //huggingface.co/collections/Qwen/qwen35(accessed March 2026)

work page 2026

[29] [29]

Capabilities of Gemini Models in Medicine

Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023. 12

work page 2023

[32] [32]

Toward expert-level medical question answering with large language models.Nature Medicine, pages 1–8, 2025

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, pages 1–8, 2025

work page 2025

[33] [33]

Collaboration between clinicians and vision–language models in radiology report generation.Nature Medicine, 31(2):599–608, 2025

Ryutaro Tanno, David GT Barrett, Andrew Sellergren, Sumedh Ghaisas, Sumanth Dathathri, Abigail See, Johannes Welbl, Charles Lau, Tao Tu, Shekoofeh Azizi, et al. Collaboration between clinicians and vision–language models in radiology report generation.Nature Medicine, 31(2):599–608, 2025

work page 2025

[34] [34]

Medcasereasoning: Evaluating and learning diagnostic reasoning from clinical case reports.arXiv preprint arXiv:2505.11733, 2025

Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J Tao, Min Woo Sun, Alejandro Lozano, and James Zou. Medcasereasoning: Evaluating and learning diagnostic reasoning from clinical case reports.arXiv preprint arXiv:2505.11733, 2025

work page arXiv 2025

[35] [35]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Cheng- hao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foun- dation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Medical thinking with multiple images

Zonghai Yao, Benlu Wang, Yifan Zhang, Junda Wang, Iris Xia, Zhipeng Tang, Shuo Han, Feiyun Ouyang, Zhichao Yang, Arman Cohan, and Hong Yu. Medical thinking with multiple images. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[37] [37]

Huatuogpt, towards taming language model to be a doctor

Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Guiming Chen, Jianquan Li, Xiangbo Wu, Zhang Zhiyi, Qingying Xiao, et al. Huatuogpt, towards taming language model to be a doctor. InFindings of the association for computational linguistics: EMNLP 2023, pages 10859–10885, 2023

work page 2023

[38] [38]

An agentic system for rare disease diagnosis with traceable reasoning.Nature, pages 1–10, 2026

Weike Zhao, Chaoyi Wu, Yanjie Fan, Pengcheng Qiu, Xiaoman Zhang, Yuze Sun, Xiao Zhou, Shuju Zhang, Yu Peng, Yanfeng Wang, et al. An agentic system for rare disease diagnosis with traceable reasoning.Nature, pages 1–10, 2026

work page 2026

[39] [39]

Ask patients with patience: Enabling llms for human-centric medical dialogue with grounded reasoning

Jiayuan Zhu, Jiazhen Pan, Yuyuan Liu, Fenglin Liu, and Junde Wu. Ask patients with patience: Enabling llms for human-centric medical dialogue with grounded reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2846–2857, 2025

work page 2025

[40] [40]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025. 13 Appendix A Pictorial Illustration of DDX-TRACE Workflow 15 B Limitations 15 C Benchmark Statistics and Data Distribution 16 D Extend...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Confidence Alignment 𝒕𝒏𝒕"𝒕# 𝒕𝒏𝒕"𝒕# Hidden Protocol Pool \ MRI T1 Mapping CTA / MRA… MRIFLAIR Ultrasound Unlocked studies Inactivated studies …… Figure 4:DDX-TRACE overview.Each case begins with limited clinical history only, and the agent does not receive a list of available examinations. The agent interacts with the environment by requesting imaging exam...

work page

[42] [42]

Based on the imaging figures provided, what is the most likely diagnosis?

Candidate-model decoding settings are reported separately in Appendix J.1. For each case, the judge receives the case-specific diagnosis and localization rubrics, the model’s final output, and the set of unique diagnosis strings that appeared anywhere in the trajectory. It returns normalized endpoint scores for diagnosis and localization, together with ex...

work page

[43] [43]

Localisation -- where the abnormality is

work page

[44] [44]

3", "2",

Diagnosis -- the most likely diagnosis. REQUIREMENTS - Output must be a single JSON object that strictly matches the provided JSON Schema. - Provide analytic criteria for score levels "3", "2", "1", "0" in EACH of the sections (Localisation / Diagnosis). - Include a **reference_answer** for EACH section: - Localisation.reference_answer must be an object w...

work page

[45] [45]

final diagnosis quality using the provided case-specific diagnosis rubric

work page

[46] [46]

final localization quality using the provided case-specific localization rubric

work page

[47] [47]

final four-item differential-list quality using the reference differential set and the global rubric below

work page

[48] [48]

final_scores

exact/acceptable/unmatched labels and 0-3 diagnosis-rubric scores for every diagnosis string in the trajectory. Global rubric for final differential-list quality (0-3): - 0: The list is mostly off-target, fails to include the final diagnosis or close equivalent, and has little overlap with the reference differential set. - 1: The list contains one or more...

work page 2026